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Abstract 

Optimization problems with rank constraints arise in many applications, including matrix 
regression, structured PCA, matrix completion and matrix decomposition problems. An attrac¬ 
tive heuristic for solving such problems is to factorize the low-rank matrix, and to run projected 
gradient descent on the nonconvex factorized optimization problem. The goal of this problem 
is to provide a general theoretical framework for understanding when such methods work well, 
and to characterize the nature of the resulting fixed point. We provide a simple set of conditions 
under which projected gradient descent, when given a suitable initialization, converges geomet¬ 
rically to a statistically useful solution. Our results are applicable even when the initial solution 
is outside any region of local convexity, and even when the problem is globally concave. Work¬ 
ing in a non-asymptotic framework, we show that our conditions are satisfied for a wide range 
of concrete models, including matrix regression, structured PCA, matrix completion with real 
and quantized observations, matrix decomposition, and graph clustering problems. Simulation 
results show excellent agreement with the theoretical predictions. 


1 Introduction 


There are a variety of problems in statistics and machine learning that require estimating a matrix 
that is assumed—or desired—to be low-rank. For high-dimensional problems, the low-rank property 
is is useful as a form of regularization, and also can lead to more interpretable results in scientihc 
settings. Low-rank matrix estimation can be formulated as a nonconvex optimization problem 
involving a cost function, measuring the ht to the data, along with a rank constraint. Even when 
the cost function is convex—such as in the ubiquitous case of least-squares htting—solving a rank- 
constrained problem can be computationally difficult, with many interesting special cases known to 
have NP-hard complexity in the worst-case setting. However, statistical settings lead naturally to 
random ensembles, in which context such complexity concerns have been assuaged to some extent 
by the use of semidehnite programming (SDP) relaxations. These SDP relaxations are based on 
replacing the nonconvex rank constraint with a convex constraint based on the trace/nuclear norm. 
For many statistical ensembles of problems, among them multivariate regression, matrix completion 
and matrix decomposition, such types of SDP relaxations have been shown to have near-optimal 
performace (e.g., see the papers 54 51 50 22 42 and references therein). Although in theory, 


any SDP can be solved to e accuracy in polynomial-time l52l, the associated computational cost is 
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often too high in practice. Letting d denote the dimension of the matrix, it can be as high as d® 
nsing standard interior point methods ED [^; snch a scaling is not practical for many real-world 


applications involving high-dimensional matrices. More recent work has developed algorithms that 
are specifically tailored to certain classes of SDPs; however, even snch specialized algorithms reqnire 
at least d? time, since solving the SDP involves optimizing over the space of d x d matrices. 

In practice, researchers often resort instead to henristic methods that directly optimize over 
the space of low-rank matrices, nsing iterative algorithms snch as alternating minimization, power 
iteration, expectation maximization (EM) and projected gradient descent. Letting r denote the 
rank, these factorized optimization problems live in an 0{rd) dimensional space, as opposed to the 
0{d?') space of the original problem. Snch henristic methods are qnite effective in practice for some 
problems, bnt sometimes can also snffer from local optima. These intrigning phenomena motivate a 
recent and evolving line of work on nnderstanding snch iterative methods in the low-rank space. As 
we discnss in detail below, recent work has stndied some of these algorithms in a nnmber of specific 
settings. A natnral question then arises: is there a general theory for nnderstanding when low-rank 
iterative methods will sncceed? 

In this paper, we make progress on this general qnestion by focnsing on projected gradient 
descent in the low-rank space. We characterize a general set of conditions that govern the compn- 
tational and statistical properties of the solntions, and then specialize this general theory to obtain 
corollaries for a broad range of problems. In more detail, snppose that we write a rank-r matrix 
M G in its factorized form F®F = FF'^ ^ where F G and consider projected gradient 

descent methods in the variable F. The matrix qnadratic form F®F makes the problem inherently 
nonconvex, and in many cases, the problem is not even locally convex. Nevertheless, onr theory 
shows that given a snitable initialization, projected gradient descent converges geometrically to 
a statistically nsefnl solntion, nnder conditions that are mnch more general than convexity. Onr 
resnlts are applicable even when the initial solntion is ontside any region of local convexity, or 
when the problem is globally concave. Each iteration of projected gradient descent typically takes 
time that is linear in dr, the degrees of freedom of a low-rank matrix, as well as in the inpnt size. 
Therefore, by directly enforcing low-rankness, onr method simnltaneonsly achieves two goals: we 
not only attain statistical consistency in the high-dimensional regime, bnt also gain computational 
advantages over convex relaxation methods that lift the problem to the space oi d x d matrices. 

For this approach to be relevant, an eqnally important qnestion is when the above conditions for 
convergence are satisfied. We verify these conditions for a broad range of statistical and machine 
learning problems, inclnding matrix sensing, matrix completion in both its standard and one-bit 
forms, sparse principal component analysis (SPCA), graph clnstering, and matrix decomposition 
or robnst PCA. For each of these problems, we show that a snitable initialization can be obtained 
efficiently nsing simple methods, and the projected gradient descent approach has sample complexity 
and statistical error bonnds that are comparable (and sometimes better) to the best existing resnlts 
(which are often achieved by convex relaxation methods). Notably, onr approach does not reqnire 
nsing fresh samples in each iteration—a henristic known as sample splitting that is often nsed to 
simplify analysis—nor does it involve the compntation of mnltiple singnlar valne decompositions 
(SVDs). 

Let ns now pnt onr contribntions in a broader historical context. The seminal work in |12| stndies 
the problem of obtaining low-rank solntions to SDPs nsing gradient descent on the factor space. 
Several snbseqnent papers aim to obtain rigorons gnarantees for nonconvex gradient descent focnsed 
on specific classes of matrix estimation problems. For instance, the recent papers |66l stndy 
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exact recovery in the setting of noiseless matrix sensing (i.e., solving linear matrix equalities with 
random designs). Focusing on the rank-one setting, De et al. study the noiseless matrix 

completion problem, and a stochastic version of nonconvex gradient descent; they prove global 
convergence with a constant success probability, assuming independence between the samples used 
by each iteration. The recent manuscript |57| studies several variants of nonconvex gradient descent 
algorithms, again for noiseless matrix completion. Another line of work |17[ [2^ considers the phase 
retrieval problem, which can be reformulated as recovering a rank one (r = 1) matrix from random 
quadratic measurements. The regularity conditions imposed in this work bear some similarity with 
our conditions, but their validation requires a very different analysis. An attractive feature of 
phase retrieval is that it is known to be locally convex around the global optimum under certain 
settings 
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The work in this paper develops a unihed framework for analyzing the behavior of projected 
gradient descent in application to low-rank estimation problems, covering many of the models 
described above as well as various others. Our theory applies to matrices of arbitrary rank r, and is 
framed in the statistical setting of noisy observations, allowing for noiseless observations as a special 
case. When specialized to particular models, our framework yields a variety of corollaries providing 
guarantees for concrete statistical models that have not been studied in the work above. Notably, 
our general conditions do not depend on local convexity, and thus can be applied to models such 
as sparse PCA and clustering in which no form of local convexity holds. (In fact, our results apply 
even when the loss function is globally concave). In addition, we impose only a natural gradient 
smoothness condition that is much less restrictive than the vanishing gradient condition imposed 
in other work. Thus, one of the main contributions of this paper is to illuminate to weakest known 
conditions under which nonconvex gradient descent can succeed, and also allows for applications to 
several problems that lack local convexity and vanishing gradients. 

It is also worth noting that other types of algorithms for nonconvex problems have also been 
analyzed, including alternating minimization 32 34], EM algorithms (4, 63 and power meth¬ 


ods |33], various hard-thresholding and singular value projection 


nonconvex regression and spectrally sparse recovery problems 47 
on Grassmannian manifolds 


40, 65 
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9], gradient descent for 
as well as gradient descent 
Finally, there is a large body of work on convex-optimization 
based approach to the concrete examples considered in this paper. We compare our statistical 
guarantees with results of these types after the statements of each of our corollaries. 


Notation: The Tth row and j-th column of a matrix Z are denoted by Zi. and Z.j, respectively. 

The spectral norm |||.Z’|||op is the largest singular value of Z. The nuclear norm |||Z|||nu(. is the sum 

of the singular values of Z. For parameters 1 < a,b < oo and a matrix Z, the ia/h norm of Z 

1 

is |||^|||b,a = ( ll^i-llb) “—that is, the ia norm of the vector of the norms of the rows. Special 
cases include the Frobenius norm |||.Z’|||f = |||.Z’||| 2 , 2 ) the elementwise ii norm ||.Z’||i = |||.Z’|||i^i and the 
elementwise ioo norm ||.Z’||oo = |||■Z^|||oo,oo• For a convex set T, we use Ht to denote the Euclidean 
projection onto T. 


2 Background 

We begin by setting up the class of matrix estimators to be studied in this paper, and then providing 
various concrete examples of specihc models to which our general theory applies. 
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2.1 Matrix estimators in the factorized formulation 

Letting denote the space of all symmetric d-dimensional matrices, this paper focnses on a 

class of matrix estimators that take the following general form. For a given sample size n > 1, let 
Cn ■ —?• M be a cost fnnction. It is a random fnnction, since it depends (implicitly in onr 

notation) on the observed data, and the fnnction valne £„(M) provides some measnre of fit of the 
matrix M to the given data. For a given convex set M C we then consider a minimization 

problem of the form 


min Cn{M) snch that M ^ 0 and M G Ai. (1) 

The goal of solving this optimization problem is to estimate some nnknown target matrix M*. 
Typically, the target matrix is a (near)-minimizer of the popnlation version of the program—that 
is, a solntion to the same constrained minimization problem with £„ replaced by its expectation 
C{M) = E[£„(M)]. However, onr theory does not reqnire that M* minimizes this quantity, nor 
that the gradient VC{M*) vanish. 

In many cases, the matrix M* either has low rank, or can be well-approximated by a matrix 
of low rank. Concretely, if the target matrix M* has rank r < d, then it can be written in the 
onter prodnct form M* = for some other matrix F* G with orthogonal colnmns. 

This factorized representation motivates ns to consider the fnnction Cn{F) : = Cn{F®F)^ and the 
factorized formnlation 


min Cn{F) snch that F £ F, ( 2 ) 

where F is some convex set that contains F*, and for which the set |F’( 8 )F’ | F G T”} acts as 
a snrrogate for Ai. Note that dne to the factorized representation of the low-rank matrix, this 
factorized program is (in general) nonconvex, and is typically so even if the original program Q is 
convex. 

Nonetheless, we can apply a projected gradient descent method in order to compnte an ap¬ 
proximate minimize!'. For this particnlar problem, the projected gradient descent npdates take the 
form 

F'+i = - rj^VCn{F^)) (3) 

where 77 * > 0 is a step size parameter, Hjr denotes the Enclidean projection onto the set F, and 
the gradient^ is given by A/Cn{F) = \yMC.n{F®F) + {AIMFn{F®F)y^F. The main goal of this 
paper is to provide a general set of snfhcient conditions nnder which—np to a statistical tolerance 
term —the seqnence {F^}^q converges to some F* snch that F*®F* = M*. 

A significant challenge in the analysis is the fact that there are many possible factorizations of 
the form M* = F*®F*. In order to address this issne, it is convenient to define an eqnivalent class 
of valid solntions as follows 

S{M*) :={F* G I F*®F* = M*, F*'^ F* = 0, Vi / j }. (4) 

^This gradient takes the simpler form V£„(T) = 2AMCn{F®F)F whenever ACn{F®F) is symmetric, which is 
the case in the concrete examples that we treat. 
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For the applications of interest here, the underlying goal is to obtain a good estimate of any matrix 
in the set 8{M*). In particular, such an estimate implies a good estimate of M* itself as well as 
the column space and singular values of all the members of the class £{M*). Accordingly, we define 
the pseudometric 


d(F,F*):= min |||F-F*|||f. (5) 

F*ee{M*) 

Note that all matrices F* G £{M*) have the same singular values, so that we may write the singular 
values ai{F*) > • • • > ar{F*) > 0 as well as |||-F*|||op and |||-F*|||f without any ambiguity. In fact, this 
invariant property holds more generally for any function of the sorted singular values and column 
space of F* (e.g., any unitarially invariant norm). 

2.2 Illustrative examples 

Let us now consider a few specific models to illustrate the general set-up from the previous section. 
We return to demonstrate consequences of our general theory for these (and other) models in 
Section H] 

2.2.1 Matrix regression 

We begin with a simple example, namely one in which we make noisy observations of linear projec¬ 
tions of an unknown low-rank matrix M* G In particular, suppose that we are given n i.i.d. 

observations {(?/i, ^i)}(Li of th® form 

yi = trace(A*M*) + €i for i = 1,..., n, (6) 

and is some i.i.d. sequence of zero-mean noise variables. The paper provides various 

examples of such matrix regression problems, depending on the particular choice of the regression 
matrices 

Original estimator: Without considering computational complexity, a reasonable estimate of 
M* would be based on minimizing the least-squares cost 

£„(M):=—J^(yi-trace(WM))' (7) 

i=l 

subject to a rank constraint. However, this problem is computationally intractable in general due 
to the nonconvexity of the rank function. A standard convex relaxation is based on the nuclear 
norm |||M|||nuc := corresponding to the sum of the singular values of the matrix. In 

the symmetric PSD case, it is equivalent to the trace of the matrix. Using the nuclear norm as 
regularizer leads to the estimator 

min I— (yi — trace(X*M))^]- such that M FQ and |||M||Lut. < R, 

I 2n > 

i=\ 

where i? > 0 is a radius to be chosen. This is a special case of our general estimator Q with Cn 
being the least-squares cost and the constraint set M. = {M G | |||M|||nut. < i?}. 
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Population version: Suppose that the noise variables are i.i.d. zero-mean with variance cr^, 
and the regression matrices are also i.i.d., zero-mean and such that E[trace(XjM)^] = |||M|||p 

for any matrix M. Under these conditions, an easy calculation yields that the population cost 
function is given by C{M) = — M*|||p -|- For this particular case, note that M* is the 

unique minimizer of the population cost. 

Projected gradient descent: The factorized cost function is given by 

CniF) = ^Y.{yi- {xHF0F)} , (8) 

i=l 

and has gradient VCn{F) = ^ X]^=i( 2 /i — trace (X*(F( 8 )F)) assuming each X* is symmetric. 

Setting T = the projected gradient descent updates ^ reduce to usual gradient descent—that 

is, 

F‘+i = F^ - rfVCn{F^), for t = 0 , 1 ,.... 

We return to analyze these updates in Section [4.2| 

2.2.2 Rank-r PCA with row sparsity 

Principal component analysis is a widely used method for dimensionality reduction. For high- 
dimensional problems in which d ^ n, it is well-known that classical PCA is inconsistent |39] . 
Moreover, minimax lower bounds show that consistent eigen-estimation is impossible in the absence 
of structure in the eigenvectors. Accordingly, a recent line of work (e.g., H § |5^ [^) has 
studied different forms of PCA with structured eigenvectors. 

Here we consider one such form of structured PCA, namely a rank r model with row-wise 
sparsity. For a given signal-to-noise ratio 7 > 0 and an orthonormal matrix F* G consider a 

covariance matrix of the form 

S = 7 {F*®F*) +h. (9) 

'-V-" 

M* 

By construction, the columns of F* span the top rank-r eigenspace of S with the corresponding 
maximal eigenvalues 7 -|- 1. In the row-sparse version of this model |59] , this leading eigenspace is 
assumed to be supported on k coordinates—that is, the matrix F* has at most k non-zero rows. 
Given n i.i.d. samples {xi}^^^ from the Gaussian distribution N{0,T,), the goal of sparse PCA is 
to estimate the sparse eigenspace spanned by F*. 

Original estimator: A natural estimator is based on a semidefinite program, referred to as the 
Fantope relaxation in the paper ( 6 O] , given by 

min I — trace(SnM)| such that trace(M) < r and ||M||i < R, (10) 

O^M^Id 

where is the empirical covariance matrix, and R > 0 is a radius to be chosen. This is a special 
case of our general set-up with £„(M) = — trace(S„M) and 

A4 : = |m E I trace(M) < r and ||M||i < R |. 
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Population version: Since E[S„] = S, the population cost function is given by 

£(M) =E[£„(M)] = -trace(SM). 

Thus, by construction, for any radius R > ||P*(8)-F*||i, the matrix M* = F*0F* is the unique 


minimizer of the population version of the problem (10), subject to the constraint M G 


Projected gradient descent: 

the SDP 


For a radius R to be chosen, we consider a factorized version of 


£„(F) :=-((£„, F®F)), F:={Fg 


ndxr 


op < 1, 


| 2,1 


<R}, 


( 11 ) 


where we recall that ||P|| 2 ,i = '^i=i This norm is the appropriate choice for selecting 


matrices with sparse rows, as assumed in our initial set-up. We return in Section 4.3 to analyze the 
projected gradient updates ^ applied to pair {Cn,R) in equation ( [IT| ). 

As a side-comment, this example illustrates that our theory does not depend on local convexity 
of the function Cn- In this case, even though the original function Cn is convex (in fact, linear) in 
the matrix M G observe that that the function Cn from equation ( [TT| ) is never locally convex 

in the low-rank matrix F G in fact, since is positive semidehnite, it is a globally concave 

function. 


2.2.3 Low-rank and sparse matrix decomposition 

There are various applications in which it is natural to model an unknown matrix as the sum of two 
matrices, one of which is low-rank and the other of which is sparse. Concretely, suppose that we 
make observations of the form Y = M* + S* + E where M* is low-rank, the matrix S* is symmetric 
and elementwise-sparse, and E is a symmetric matrix of noise variables. Many problems can be 
cast in this form, including robust forms of PCA, factor analysis, and Gaussian graphical model 
estimation; see the papers |19[ and references therein for further details on these and 

other applications. 

Original estimator: Letting Sj G denote the column of a matrix S G dehne the 

set of matrices 5 : = {S' G | ||Sjj|i < Rj for j = 1,2,..., d}, where (i?i,..., Rd) are user- 

dehned radii. Using the nuclear norm and ii norm as surrogates for rank and sparsity respectively, 
a popular convex relaxation approach is based on the SDP 

min min |||y — (M -|- 5)|||p| subject to M ^ 0 and |||M||Lm. < R, 

1 2 5e5 J 

This is a special case of our general estimator with Cn{M) := ^niinly — (M -|- S')|||p, and the 

S^S 

constraint set A4 : = {M G \ |||M|||„ue < R}. 


Population version: 


In this case, the population function is given by 


£(M) : = E 


- min |||y 
12 S&S 


(M + S)|||2} 


where the expectation is over the random noise matrix E. In general, we are not guaranteed that 
M* is the unique minimizer of this objective, but our analysis shows that (under suitable conditions) 
it is a near-minimizer, and this is adequate for our theory. 
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Projected gradient descent: In this paper, we analyze a version of gradient descent that oper¬ 
ates on the pair {Cn-,^) given by 


^^n{F) = - mm |||y - {{F®F) + 


and F ■. = \ F ^ 


r)dxr 



( 12 ) 


Here F^ is the initialization of the algorithm, and the parameter /r > 0 controls the matrix incoher¬ 


ence. See Sections 4.1 and |4.6| for discnssion of matrix incoherence parameters, and their necessity 
in snch problems. The gradient of Cn takes the form 


vCn{F) = 2{n5(y - {F®F)) - (y - (f®f))}j 


where H^ denotes projection onto the constraint set S. This projection is easy to carry it, as it 
simply involves a soft-thresholding of the colnmns of the matrix. Likewise, the projection onto the 


set F from eqnation (12) is easy to carry ont. We retnrn to analyze these projected gradient npdates 
in Section 14.61 


In addition to the three examples introdnced so far, onr theory also applies to varions other 
low-rank estimation problems, inclnding that of matrix completion with real-valned observations 


(Section 4.1) and binary observations (Section 4.5), as well as planted clnstering problems (Sec¬ 


tion 4.4). 


3 Main results 


In this section, we trim to the set-np and statement of onr main resnlts on the convergence properties 
of projected gradient descent for low-rank factorizations. We begin in Secti on |3.1| by stating the 
conditions on the fnnction and F that nnderlie onr analysis. In Section 3.2 we state a resnlt 
(Theorem 1) that gnarantees snblinear convergence, whereas Section 3.3 is devoted to a resnlt 
(Theorem 2) that gnarantees faster linear convergence nnder slightly stronger assnmptions. In 
Section]^ to follow, we derive varions corollaries of these theorems for different concrete versions of 
low-rank estimation. 

Given a radins p > 0, we define the ball E 2 (p;F*) := |F G | d(F,F*) < pj- At a 

high level, onr goal is to provide conditions nnder which the projected gradient sequence 
converges some multiple of the ball lB 2 (£n; F*), where e^i > 0 is a statistical tolerance. 


3.1 Conditions on the pair 

Recall the definition of the set £{M*) of equivalent orthogonal factorizations of a given matrix M*. 
We begin with a condition on F that gnarantees that it respects the strnctnre of this set. 

M*-faithfulness of F: For a radins /?, the constraint set F is said to be M* -faithful if for each 
matrix F € F nM 2 {p‘, F*), we gnaranteed that 


arg min |||A — F|||p C y. 
AeSiM*) 


(13) 
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Of course, this condition is implied by the inclusion £{M*) C T. The M*-faithfulness condition is 
natural for our setting, as our goal is to estimate the eigen structure of M*, and the set T should 
therefore represent prior knowledge of this structure and be independent of a specific factorization 
of M*. 


Local descent condition: Our next condition provides a guarantee on the cost improvement that 
can be obtained by taking a gradient step when starting from any matrix F that is “sufficiently” far 
away from the set £{M*). 

Definition 1 (Local descent condition). For a given radius p > 0, curvature parameter a > 0 and 
statistical tolerance Sn > 0, a cost function satisfies a local descent condition with parameters 

there is some Fir* £ arg min |||^ — -FIIIf such 


(a, /3, En, p) over F if for each F G F” n 
that 

((V£„(T), F-F*))>a\\F-F^ 


\\1-^\IF^.-F* 

a 


— 


VF* G £{M*). 


(14) 


In order to gain intuition for this condition, note that by a first-order Taylor series expansion, 
we have F„(F) — £„(Flr*) ~ ((V£„(F’), F — Fn-*)); so that this inner product measures the potential 
gains afforded by taking a gradient step. Now consider some matrix F such that |||F —F^r* |||f > 
so that its distance from £{M*) is larger than the statistical precision. The lower bound (14) with 
F* = F,r*then implies that 


£-n{F) — F7i(Fn-*) 


a , 


((V£n(F), F-F,*)) > -IF-Frr 


II2 

IIF) 


which guarantees a quadratic descent condition. Note that the condition (14) actually allows for 


additional freedom in the choice of F* so as to accommodate the non-uniqueness of the factorization. 
One way in which to establish a bound of the form (14) is by requiring that Cn be locally 


strongly convex, and that the gradient V£n(-Fn-*) approximately vanishes. In particular, suppose 
Cn is 2a-strongly convex over the set F H IB 2 {p] F*)-, in the sense that 


iVCn{F) - VCn{F^*), F - F^*)) > 2a|||F - F^* 


for all F G FnM 2 ip-,F*). 


If we assume that |||V£ri,(Flr*)|||F < ctSn, then some simple algebra yields that the lower bound (14) 
holds. 

However, it is essential to note that our theory covers several examples in which a lower 

even though C 

AF. 


bound (14) of the form holds, even thoi^h Cn fails to be locally convex, and/or the gradient 
does not approxi mately vanishln Examples include the problem of sparse PCA, previ- 

in this case, the function Cn is actually globally concave, but 




2 . 2.2 


ously introduced in Section 
nonetheless our analysis in Section 4.3 shows that a local descent condition of the form (14) holds 


Similarly, for the planted clustering model studied in Section [4.4[ the same form of global concavity 
holds. In additi on, f or the matrix regression problem previously introduced in Section 2.2.1 


we 


prove in Section 4.2 that the condition (14) holds over a set over which Cn is nonconvex. The 


generality of our condition (14) is essential to accommodate these and other examples. 


^We note that the vanishing gradient condition is needed in all existing work on nonconvex gradient descent |l7| 
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Local Lipschitz condition: Our next requirement is a straightforward local Lipschitz property: 
Definition 2 (Local Lipschitz). The loss function is locally Lipschitz in the sense that 

|||V£„(F)|||,<L|||F*L,. (15) 


for allFeJ^nB2(/9;F*). 


Local smoothness: Our last condition is not required to establish convergence of projected 
gradient descent, but rather to guarantee a faster geometric rate of convergence. It is a condition— 
complementary to the local descent condition—that upper bounds the behavior of the gradient map 
VCn- 

Definition 3 (Local smoothness). For some curvature and smoothness parameters a and /3, statis¬ 
tical tolerance and radius p, we say that the loss function £„ satisfies a local smoothness condition 
with parameters (a, j3, Sn, p) over IF if for each F, F' £ F r\ B2(/9; F*) and F* £ S{M*), 

\{{VMCn{F)-VMCn{F'), F - F*))\ < (/3|||F-F'|||p + ae„)|||F-F*|||p. (16) 

The above conditions are stated in terms of the loss function Cn for the factor matrix F. 
Alternatively, one may restate these conditions in terms of the loss function £„ on the original 
space, and we make use of this type of reformulation in parts of our proofs. For instance, see 
Section [6] for details. 


3.2 Sublinear convergence under Lipschitz condition 


With our basic conditions in place, we are now ready to state our first main result. It guaran¬ 
tees a sublinear rate of convergence under the M*-faithfulness, local descent, and local Lipschitz 
conditions. 

More precisely, for some descent and Lipschitz parameters a < L, a statistical tolerance Sn > 0, 
and a constant r G (0, ^), suppose that Sn < ^F^ar{F*), the cost functions Cn satisfies the local de¬ 
scent and Lipschitz conditions (Definitionsj^andj^ with parameters a, L, and p = (1 — T)ar{F*), 
and the constraint set F is M*-faithful and convex. Let k = k{F*) : = be the condition num¬ 

ber of F*. We then have the following guarantee: 


Theorem 1. Under the previously stated conditions, given any initial point F^ belonging to the set 
J-'nB 2 ((l — T)ar{F*)] F*), the projected gradient iterates with step size tf = ^ 

satisfy the bound 


L‘^ / 


d2(F*,T*) < 


20L^|||F* 

ta^ 


+ for all iterations t = 1, 2 ,.... 


(17) 


See Section 5.1 for the proof of this claim. 

As a minor remark, we note that the assumption < ^^^ar{F*) entails no loss of generality— 
if it fails to hold, then the initial solution F^ already satisfies an error bound better than what is 
guaranteed for subsequent iterates. 

Conceptually, Theorem provides a minimal set of conditions for the convergence of projected 
gradient descent using the nonconvex factorization M = F®F. The first term on the right hand side 
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of equation 0 corresponds to the optimization error, whereas the second term is the statistical 
error. The bound ( |17[ ) shows that the distance between and F* drops at the rate 0(j) up to 
the statistical limit that is determined by the sample size and the signal-to-noise ratio (SNR) of 
the problem. We see concrete instances of this statistical error in the examples to follow. 


3.3 Linear convergence under smoothness condition 

Although Theorem does guarantee convergence, the resulting rate is sublinear {0{l/t)), and 
hence rather slow. In this section, we show that if in addition to the local Lipschitz and descent 
conditions, the function satisfies the local smoothness conditions in Definition then much 
faster convergence can be guaranteed. 

More precisely, suppose that for some numbers a, l3, L, and r with 0<a</3 = L, 0<r<l 
and Sn < ^^ar{F*), the loss function £„ satisfies the local descent, Lipschitz and smoothness 
conditions in Definitions [lj|^ over F with parameters a, (3, L, and p = (1 — r^)(Tr(T*), and that 
the set F is M*-faithful and convex. 


Theorem 2. Under the previously stated conditions, there is a constant 0 < Cr < 1 depending only 
on T such that given an initial matrix F^ in the set Fr\M 2 {{l — T)ar{F*); F*), the projected gradient 
iterates with step size p* = Cr—^ satisfy the bound 


d‘^{F^, F*) < fl — Cr g 02 ) d^iF^, F*) + 16e^ for all iterations t = 1,2, 

\ Kf O ' 


See Section [5.2| for the proof of this claim. 


(18) 


The right hand side of the bound (18) again consists of an optimization error term and a 
statistical error term. The theorem guarantees that the optimization error converges linearly at 
the geometric rate 0{(l — c)*) up to a statistical limit. Note that the theorem requires the initial 
solution F^ to lie within a ball around F* with radius (1 — T)(Tr{F*), which is slightly smaller than 
the radius p = (1 — T‘^)ar{F*) for which the local descent, Lipschitz and smoothness conditions 
hold. Moreover, the step size and the convergence rate depend on the condition number of F* 
as well as the quality of the initialization through r. We did not make an attempt to optimize 
this dependence, but improvement in this direction, including adaptive choices of the step size, is 
certainly an interesting problem for future work. 


4 Concrete results for specific models 

In this section, we turn to the consequences of our general theory for specific models that arise in 
applications. Throughout this section, we focus on geometric convergence guaranteed by Theoremj^ 
using a constant step size. The main technical challenges are to verify the local descent, local 
Lipschitz and local smoothness assumptions that are needed to apply this result. Since Theorem 
depends on weaker assumptions—it does not need the local smoothness property—it should be 
understood also that our analysis can be used to derive corollaries based on Theorem as well. 

Note: In all of the analysis to follow, we adopt the shorthand aj = aj{F*) for the singular values 
of F*, and k = — for its condition number. 

’ <Jr 
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4.1 Noisy matrix completion 

We begin by deriving a corollary for the problem of noisy matrix completion. Since we did not 


discuss this model in Section 2.2, let us provide some background here. There are a wide variety of 


matrix completion problems (e.g., |43|), and the variant of interest here arises when the unknown 
matrix has low rank. More precisely, for an unknown PSD and low-rank matrix M* G suppose 

that we are given noisy observations of a subset of its entries. In the so-called Bernoulli model, the 
random subset of observed entries is chosen uniformly at random—that is, each entry is observed 
with some probability p, independently of all other entries. We can represent these observations by 
a random symmetric matrix Y G with entries of the form 


Y — 


M*. + E, 




with probability p, and 
otherwise. 


for each i > j. 


(19) 


Here the variables {Eij,i > j} represent a form of measurement noise. 

A standard method for matrix completion is based on solving the semidefinite program 




mm 

MeS'ixd L 2p 

(*j)60 


such that M Y 0 and 


<r /? 

nuc _ -^5 


( 20 ) 


Cn{M) 


where i? > 0 is a radius to be chosen. As noted above, the PSD constraint and nuclear norm bound 
are equivalent to the trace constraint trace(M) < R. In either case, this is a special case of our 
general estimator (Q. 


The SDP-based estimator (20) is known to have good performance when the underlying matrix 


M* satisfies certain matrix incoherence conditions. These conditions involve its leverage scores, 
defined in the following way. Here we consider a simplified setting where the eigenvalues of M* 
are equal. By performing an eigendecomposition, we can write M* = UDU"^ where D G is 
a diagonal matrix of eigenvalues (a constant multiple of the identity when they are constant), and 
take F* = With this notation, the incoherence parameter of M* = F*0F* is given by 


d max ||Tt||? 

h- ■- III 2 

III III op 




2,00 


T III F* III ^ 

III III op 


( 21 ) 


Since we already enforce low-rankness in the factorized formulation, we can drop nuclear norm 
constraint. The generalized projected gradient descent ([^ is specified by letting Cn and T set 


^ ^ {{F®F)ij-Yijf and F := {f G 


hdxr 


(jjjeo 



Note that F is convex, and depends on the initial solution M^. The gradient of Cn is S/M^niM) = 
— y), and the projection Hj- is given by the row-wise “clipping” operation 


[n.F(0)]. = 



for i = 1, 2,..., d. 


This projection ensures that the iterates of gradient descent ^ remains incoherent. 
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Remark 1. Note that ||i^j *||2 = ||n col(F*)(ei)ll 2 standard basis vector and col(F*) is 

the column space of F*), so the values of ||-F*|| 2 ,oo and depend only on col(-F*) and are the same 
for any F* in £{M*). 

With this notation in place, we are now ready to apply Theorem to the noisy matrix com¬ 
pletion problem. As we show below, if the initial matrix F^ satishes the bound d{F^,F*) < 
g|||T*|||op, then the set F is M*-faithful. Moreover, if the expected sample size satishes n = pcP F 
maxjiurdlog d,then with probability at least 1 —4d“^ the loss function Cn satishes the local 
descent, Lipschitz and smoothness conditions with parameters 




III 2 

III op 5 


L = I5 


C2pr\F* 


III 2 

III op 


and 


£n = 100 


v/^ll|no(ii;)|||op 

pIII^IIIop 


Using this fact, we have the following consequence of Theorem |^ which holds when the sample 
size size n satishes the bound above and is large enough to ensure that £n < i^|||T*|||op. 


Corollary 1. Under the previously stated conditions, if we are given an initial matrix F^ satisfying 
the bound d{F^, F*) < g|||T*|||op, then with probability at least l — 4d~^, the gradient iterates 
with step size rf = ^ 2 ^. 2 satisfy the bound 


d2(F‘,F-)< (i-C4^)T(F»,F-) + C5 


>'ll|nn(E)||| 


2111 Z ?*|||2 


p2|||i? 


See Section 6.2 for the proof of this claim. 


( 22 ) 


Even though Corollary is a consequence of our general theory, it leads to results for ex¬ 
act/approximate recovery in the noiseless/noisy setting that are as good as or better than known 
results. In the noiseless setting (E = 0), our sample size requirement and contraction factor are 
sharper than those in the paper |57| by a polynomial factor in the rank r. Turning to the noisy set¬ 
ting, suppose the noise matrix E has independent sub-Gaussian entries with parameter A techni¬ 
cal result to be proved later (see Lemma 11) guarantees that given a sample size n = pd? F cidlog^ d, 
we have the operator norm bound |||nn(El)|||op with probability at least 1 — d“^^. Together 


with the bound (22), we conclude that 


d^ 


II 2 

II F — 


d2 IIU 


II 2 II 
II op II 




||2 ^ 

II F rv-/ 


pd 


a^rd 


n 


(23) 


The scaling is better than the results in past work (e.g., 51 42 41 ) on noisy matrix completion 
by a logd factor; in fact, it matches the minimax lower bounds established in the papers 42 


Thus, Corollary in fact establishes that the projected gradient descent method yields minimax- 
optimal estimates. 


Initialization: Suppose the rank-r SVD of the matrix is given by USV~^. We can take 

F^ = nj-([/52). Under the previously stated condition on the sample size n, the matrix F^ 
satishes the requirement in Corollary as shown in, e.g., |40| (combined with the above bound on 


ll|nn(F)|LJ. 
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(a) (b) (c) 

Figure 1. Simulation results for matrix completion, (a) Plots of optimization error d(F*,F^) and 
statistical error d{F*,F*) versus the iteration number t using SVD initialization. Panel (b): same 
plots using a random initialization. The simulation is performed using d = 1000, r = 10, p = 0.1 and 
a = 0.01 • g. Panel (c): plots of per-entry estimation error ■^d{F, F*) versus g, for different values 
of (d, r) using SVD-based initialization. Each point represents the average over 20 random instances. 
The simulation is performed using p = 0.1 and a = 0.001. 


Computing the gradient VpJ^niF) = — Y)F takes time 0{r‘^\Q\). 


Computation: 

The projection njr(T) can be computed in time 0{rd). 


Simulations: In order to illustrate the predictions of Corollary we performed a number of 
simulations. Since the distance measures d{F, F*) and |||T(8)T — are difficult to compute, 

so we instead use the subspace distance 


d{F,F*) PS III sin Z{F,F* 


11 ^, 


(24) 


as an approximation!^ Here sin Z{F, F*) is the vector of principal angles between the column spaces 
of matrices F and F*. For each example and given values of model parameters d,r,n,a etc., we 
generate a random instance by sampling the true matrix F* and the problem data randomly from 
the relevant model, and then run our projected gradient descent algorithm with T = 50 iterations. 

In the matrix completion case, we sampled the true matrix F* uniformly at random from all 
d X r orthonormal matrix uniformly at random, generated a noise matrix E with i.i.d. N{0,a‘^) 
entries, and chose the observed entries randomly according to the Bernoulli model with probability p. 
We considered two approaches for obtaining the initial matrix F^: (a) the SVD-based procedure 
described in Section 4.1, and (b) random initialization, where F^ is a random dx r orthonormal 


matrix projected onto the associated constraint set F. The step size for projected gradient descent 
is fixed at r/* = ^. Panels (a) and (b) in Figure show the resulting convergence behavior of the 
algorithm, which confirm the geometric convergence (and threshold effect for the statistical error) 
that is predicted by our theory. 

For these random ensembles, our theory predicts thatwith high probability the per-entry error 
of the output F satisfies a bound of the form 

—d^(F F*) ■< —• 


cf. equation (23). Therefore, with p and a fixed, the ratio ^d^(T, F*) should be proportional to V 


^This approximation is valid up to a constant of 2 if both F and F* are orthonormal (cf. Proposition 2.2 in 59 ). 


14 


















4.2 Matrix regression 




Recall the matrix regression model previously introduced in Section |2.2.1 
notation, it is convenient to introduce define a linear mapping Xn ■ 

= {{X^, M)) for i = 1,2,... ,n. Note that the adjoint operator X* 

X* (tt) = X'^Ui. With this notation, we have the compact representation 


In order to simplify 
' via [Xn{M)]i : 
(->• is given by 


XCn{F) = ‘^(xi{Xn{F®F) - y))F. 


Since ((W*, F<SiF)) = {{{X^ + X'^'^)/2, F<SiF)), we may assume without loss of generality that the 
matrices {X^} are symmetric. 

In this case, projected gradient descent can be performed with F = so that the M*- 

faithfulness condition holds trivially. It remains to verify that the cost function £„ from equation ^ 
satisfies the local descent, local Lipschitz and local smoothness properties, and these properties 
depend on the structure of the operator For instance, one way in which to certify the conditions 
of Theorem]^ is via a version of restricted isometry property (RIP) applied to the operator X„. 

Definition 4 (Restricted isometry property). The operator X^ : —)• M” is said to satisfied the 

restricted isometry property with parameter 5k if 


(1-4)|||M|||2 < -||X, 
n 


(M )||2 < (1 + 5fc)|||M|||p, for all d-dimensional matrices with rank(M) < k. 


It is well known |54 50 that RIP holds for various random ensembles. For instance, suppose 


that the entries of Xjg are i.i.d. zero-mean unit variance random variables, satisfying a sub-Gaussian 
tail bound. Examples of such ensembles include the standard Gaussian case {Xj^ ~ X(0, 1)) as well 
as Rademacher variables {Xj^ G 1} equiprobably). For such ensembles, it is known that with 
high probability, a RIP condition of order r holds with a sample size n F fd. 


The RIP condition provides a straightforward way of verifying the conditions of Theorem]^ More 
precisely, as we show in the proof of Gorollary if the operator X^ satisfies RIP with parameter 
5^r £ [0) 1 ^)) then the loss function £„ satisfies the local descent, descent and smoothness conditions 
with parameters 

p = (1 — 12(54r)cTr, a = 6S4r(^ri L = (3 = and — ^n(^)lllop ^ 

0 ^ T 

Using this fact, we have the following corollary of Theorem We state it assuming that the 
operator Xn satisfies RIP with parameter 54_r £ [0; i^)) the sample size n is large enough to 
ensure that 

Corollary 2. Under the previously stated conditions, there is a universal function : [0,1/12] —)• 
(0,1) such that given any initial matrix F^ satisfying the bound d{F^,F*) < (1 — \/T2^)o'r, the 
projected gradient iterates with step size rj^ = satisfy the bound 


d\F\F*) < (l- ^^^)*d2(FO,F*) + co 


VK 


l|:^);(e) 


11 ^. 
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See Section 6.1 for the proof of this claim. 


Note that the radius of the region of convergence IB2((1 ~ \/T2^)ar; F*) can be arbitrarily close 
to (Tr with 64 r Sufficiently small. Moreover, the function £n need not be convex in this region. As 
a simple example, consider the scalar case with d = r = 1, with noiseless observations (e = 0) of 
the target parameter F* = 1. A simple calculation then yields that Cn{F) = c{F^ — 1)^ for some 
constant c > 0, which is nonconvex outside of the ball 

Specihed to the noiseless setting with e = 0, Corollary is similar to the results for nonconvex 


gradient descent in |58 66 . In the more general noisy setting, our statistical error rate is consistent 


with the results in 51 . For a more concrete example, suppose k = 0(1), and each X* and e have 

AA(0,1) and e* ~ AA(0,cr^). It can be shown that as long as 


i.i.d. Gaussian entries with 


n F rdlogd, RIP holds with <54,. < ^ and |||3£* (e)|||op ay/nd. The bound in Corollary therefore 


implies a constant contraction factor and that 




II ^ 

N op 


,rd 


n 


Initialization: Suppose the rank-r SVD of the matrix is given by USV~^. We can take 

F^ = U. Under the above Gaussian example, it can be shown the condition on the initial solution 
is satished if n F dr'^n^ logd and a is small enough 

The sample size required for this initialization scales quadratically in the rank r, as compared 
to the linear scaling that is the best possible . This looseness is a consequence of requiring 

the initialization error to satisfy a Frobenius norm bound instead of an operator norm one. It can 
be avoided by using a more sophisticated initialization procedures—for instance, one based on a 
few iterations of the singular value projection (SVP) algorithm [^. In the current setting, since 
our primary focus is on understanding low-complexity algorithms via gradient descent, we do not 
pursue this direction further. 


36 


37 


Computation: Let Tmui be the maximum time to multiply X* with a vector in Finding the 
initial solution as above requires computing the rank-r SVD of the d x d matrix (y), which can 
be done in time ©(nrTmui + dr"^); cf. 31 . The gradient ^X* (X„(M) — y) and can be computed in 
time ©(nrTmui + dr). Therefore, the overall time complexity is ©(nrTmui -|- dr^) times the number 
of iterations. 


4.3 Rank-r PCA with row sparsity 

Recall the problem of sparse PCA previously introduced in Section |2.2.2 In this section, we analyze 
the projected gradient updates applied to this problem, in particular with the loss function from 
equation (|11[), and the constraint set 


F:={F€ 


[)dxr 


< 1 , \\F\ 


2,1 ^ 


\P* 


2,1 


}• 


To be clear, this choice of constraint set is somewhat unrealistic, since it assumes knowledge of 
the norm ||F*|| 2 ,i- This condition could be removed by analyzing instead a penalized form of the 
estimator, but as our main goal is to illustrate the general theory, we remain with the constrained 
version here. 
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We now apply Theorem]^ to this problem. As we show in the proof of Corollary the set T is 
M*-faithful. Moreover, for each 0 < r < 1, suppose that the SNR satisfies 7 > ;^ in the row-sparse 
spiked covariance model, then with probability at least 1 — 2 d~^, the loss function Cn satisfies the 


local descent, smoothness conditions and the relaxed Lipschitz condition (50) with parameters 


P=1 — t'^, a = L = f3 = A{'y + l)^/r and 


£n = Cl \/r max | 


k log d k log d I 


n 


n 


(25) 


Using these fact, we have the following corollary of Theorem We state it assuming that the SNR 


obeys the bound 'y > ^ and the sample size n is large enough to ensure 

Corollary 3. Under the previously stated conditions, there is a function if : (0,1) —?• (0,1) such 
that given any initial matrix G J^nB 2 (l — r; F*), with probability at least 1 — 2d~^, the projected 
gradient iterates with step size rf = satisfy the bound 


d^{F\ F*f < (1 - i/;(r)^-^^)*d2(F0, F*) + 

See Section [ 6 .3| for the proof of this corollary. 


{l + l?r 


max • 


k log d k"^ log^ d 


n 




}■ 


Remark 2. It is noteworthy that Cn{F) is in fact globally concave in F. In order to see this fact, 
consider the scalar case with d = r = 1, where Cn{F) = —CF^ for some C > 0 . 

The error rate Sn is in fact minimax optimal (up to a logarithmic factor) with respect to n, d, k 
as well as the rank r; for instance, see the paper |59[ [l^, where the upper bound is achieved using 


computationally intractable estimators. Similar error rates are obtained in Il4, 48 62 using more 


sophisticated algorithms, but under a scaling of the sample size—in particular one that is quadratic 
in sparsity (see below)—that allows for a good initialization. 

Initialization: The above results require an initial solution F^ with d{F^,F*) < 1. Under the 
spiked covariance model and given a sample size n F k'^ log d, such solution can be found by 


the diagonal thresholding method 39, 48l. Here the quadratic dependence on k is related to a 


computational barrier 6 l, and thus may not improvable using a polynomial-time algorithm. 


Computation: The algorithm requires projection onto the intersection of the spectral norm and 
£i /£2 norm balls. In the rank one case (r = 1), it reduces to projecting to the intersection of the 
vector £2 and ii balls, which can be done efficiently [^. In the general case with r > 1, it can be 
done by alternating projection. The speed of convergence depends on the eigengap 7 , exhibiting 
similarity to the standard power method for finding eigenvectors. 


Simulations: We performed experiments under the same general set-up as the matrix completion 


(see the discussion surrounding equation (24)). For sparse PCA, we generated random ensembles 
of problems by fixing the rank r = 1, and choosing a random unit-norm F* G supported on k 
randomly chosen coordinates. Using this random vector, we formed the spiked covariance matrix S 
with top eigenvector F* and SNR 7 . We considered two approaches for initialization: (a) diagonal 
thresholding as described in the papers 48 , and (b) choosing F^ to be the perturbed version 
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(a) (b) (c) 

Figure 2. Simulation results for sparse PCA. Panel (a); plots of optimization error d{F*,F'^) and 
statistical error d(F^,F*) versus the iteration number t, using diagonal thresholding initialization. 
Panel (b): same plots using perturbation initialization. For both panels (a) and (b), simulations are 
performed using d = 5000, r = l, fc = 5, 7 = 4 and n = 4000. Panel (c): plot of estimation error 
d{F,F*) versus for different values of {k,n) using diagonal thresholding initialization. Each point 
represents the average over 20 random instances. The simulation is performed using d = 5000, r = 1 
and 7 = 4. 


~ ~ support (i^*) and Ei and E 2 are random unit 

norm vectors with the appropriate dimensions. The step size is hxed at rf = 

Panels (a) and (b) of Figure show the convergence rates of the optimization and statistical 
error using these two different types of initializations. Consistent with our theory, we witness an 
initially geometric convergence in terms of statistical error followed by an error floor at the statistical 
precision. In panel (c), we study the scaling of the estimation error. Our theory predicts that given 
a suitable initialization and sample size nF log d, then with high probability the output E satishes 


d\E,F*) 


(7 + l)^r 


T 


k log d 
n 


Therefore, with the triplet of parameters {d,r,j) hxed, the error d‘^{F,F*) should grow proportion¬ 
ally with the ratio a prediction that is conhrmed in Figure 2[c). 


4.4 Planted densest subgraph 

The planted densest subgraph problem is a generalization of the planted clique problem; it can be 
viewed as a single cluster (or rank one) version of the more general planted partition problem. For 
a collection of d vertices, there is an unknown subset of size k which forms a cluster. Based on this 
cluster and two probabilities p > q, a random symmetric matrix A G {0,which we think of 
as the adjacency matrix of the observed graph, is generated in the following way: 

• for each pair of vertices i,j in the cluster, Aij = 1 with probability p, and zero otherwise. 

• for all other pairs of vertices, Aij = 1 with probability q, and zero otherwise. 

Let E* G |0, be the cluster membership vector: i.e., Fj = 1 if and only if vertex j belongs to 
the cluster. 
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A previous approach is to recover the cluster matrix M* = F*®F* by solving a particular SDP, 
derived as a relaxation of the MLE. Let S : = A — be a shifted version of the adjacency 

matrix, where Jd is the d x d all one matrix. Consider the semidefinite program 


mm 


I — {{S, M))| such that M ^ 0, Mij = and M G [0,1] 


dxd 


(26) 


It is known that with probability at least 1 — d^, the true cluster matrix M* is the unique 
optimal solution to this program when 


(p - q? 


> Cl 


log d d s 

+ v^), 


k 




(27) 


for some universal constant ci > 0. When p = 1 = 2q, this condition reduces to the well-known 
k F tractability region for the planted clique problem j^. 

Alternatively, we may solve the factorized formulation by projected gradient decent ([^, as 
applied to the problem 

Cn{F) = {{-S, F0 F)), F = {F\ Fg[0, I]'", = k}- 

This setting is a r = 1 special case of our general framework. In this case £{M*) = {±F*} contains 
only two elements, and can be verihed to be M*-faithful. 

We now ready to apply our general theory to this problem. As we show in proof of Corollary 
if the model parameters satisfy the condition (27), then with probability at least 1 — d“^, the 
loss function satishes the local descent and smoothness conditions and the relaxed Lipschitz 
condition (40) with parameters 

2 /- 1 

p=-yk, a = —{p-q)k, (3 = I2{p - q)k and Sn = 0- 

5 20 

Using this fact, we have the following corollary of Theorem We state it assuming that the 
condition condition (27) holds 

Corollary 4. Under the previously stated eonditions, given an initial vector F^ G Fr\B> 2 {\Vk] F*), 
the projected gradient iterates with step size rj^ = C 2 satisfy the bound 


d^{F\F*)< (l - 03^(1^{F°,F*). 


See Section [6.4| for the proof of this claim. 

The corollary guarantees exact recovery of M* when f —)• oo. The condition (27) matches the 
best existing results; see e.g., |21| and the references therein. 


Initialization Set F^ to be the top left singular vector of A — qJd projected onto the set F. Note 
that F* is a left singular vector of the matrix K[A — qJd] corresponding to the only non-zero singular 
value {p — q)k. Under the condition (27), Proposition [^ensures that |||(A — qJd) — E[A — gJ^jlop < 
\{p — q)k with probability at least 1 — d~^. On this event, applying Wedin’s sin© theorem |30| 
guarantees that F^ satisfies the requirement in Corollary]^ 


19 












(a) (b) 

Figures. Simulations for planted densest subgraph. Panel (a): plots of optimization error d(F‘, 
and statistical error d{F*,F*) versus the iteration number t, using SVD-based initialization. The 
simulation is performed using d = 8000, k = 2000, p = 0.13 and q = 0.05. Panel (b): plot of the 
probability of successful exact recovery of F* versus pd, for different values of {d,p) using SVD-based 
initialization. We declare exact recovery if d{F,F*) < 2 x 10“^, and each point represents frequency 
of exact recovery over 20 random instances. The simulation is performed with q = j and k = ^. 


Computation: The set J- is the intersection of a hyperplane and a box in so the associated 
projection IIj- can be computed in time 0{d) 49 . Computing the gradient VCn{F) = —2SF only 
requires matrix-vector multiplication with the matrix S, which is the sum of a rank -1 matrix and 


the (usually sparse) graph adjacency matrix A. In contrast, solving the SDP in equation (26) using 
ADMM requires multiple full SVD of dense matrices even when the graph is sparse. 


Simulations: We performed experiments under the same general set-up as the matrix completion 
(see the discussion surrounding equation (24)). The 0-1 cluster indicator matrix F* G is 

supported on k coordinates, and we sampled the graph adjacency matrix A from the planted densest 

nster size k. The initial matrix F^ is obtained 

. Panel 


4.4 


The step size is hxed at ■ 


subgraph model with edge probabilities {p, q) and c’ 
using the SVD-based procedure described in Section 

(a) of Figure 1^ shows plots of the optimization and statistical errors versus the iteration number; 
consistent with Corollary these iterates converge at least geometrically. 

In terms of the scaling of the sample size required for exact recovery, we know that A k ^ kjfd) 
then the hxed point of the algorithm F will be equal to F* with high probability provided that 




In particular, see equation (27). Therefore, with <? = 3 and k = 2 : exact recovery of 


F* can be achieved with probability close to one as soon as pd is above a constant threshold. This 
theoretical prediction is conhrmed in panel (b) of Figure]^ 

4.5 One-bit matrix completion 

Let us now turn to an extension of the standard (linear) matrix completion model studied in 


Section 4.1 It provides a more challenging problem to analyze, and our general theory provides (to 
the best of our knowledge) the hrst known polynomial-time algorithm for achieving the minimax 
rate in the case of rank r matrices. 

In order to set up the problem, suppose that F* G is an orthonormal matrix and has 

incoherence parameter p as previsouly dehned in equation (21). Given a set D C [d] x [d] of 
observed elements, a noise parameter cr > 0 and a differentiable function / : M 1 —)• [ 0 , 1 ] with 
Lipschitz derivative, we observe a binary symmetric matrix Y G { — 1,1}'^^'^ such that for each 
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(i, j) G O with i > j, 


Y — 


1, with probability f{M*j/a), 

— 1, with probability 1 — f{M*-/a). 


We further assume that the observation set fl is symmetric and generated by the Bernoulli model 
with parameter p, that is, (j,i) G n) = p independently for each {i,j) with i > j The goal 

is to estimate M* given the binary observations Y. Examples of the function / include the logistic 
model with f{x) = the probit model with /(x) = ‘h(x), where <l>(x) is the cumulative 

distribution function of a standard Gaussian; and the Laplacian model in which / is the cumulative 
distribution function of Laplacian(0,1): variable. See the papers 24 for more details on these 
choices. 

For a given /, consider thee negative log-likelihood of M, given by 


'1 + Yi.j 


1 - Yi, 


Cn{M) = -2 Y, [^^log/(M,,/a) + ^^log(l-/(M,,/(T)) 
{i,j)en 

= -((nn(Jd + T), log/(M/a))) - {{Un{Jd - Y), log (l - f{M/a) 


(28) 


where is the d x d all one matrix, o denotes the Hadamard product, and functions are applied to 
a matrix element-wise. As in matrix completion, we use the set J-" = {T | ||4^112,00 < 

Note that the gradient of the loss function is given by 


VMTn(M) = -Ho 

a 


fiM/a) o (y - 2/(M/a) + J^) 


f{M/a) o (1 - fiM/a)) 


where the fraction are also element-wise. 

Since the function / is differentiable with a Lipschitz derivative f, Rademacher’s theorem 
guarantees that the second derivative f” is dehned almost everywhere. Our corollary depends 
function / through the following two quantities, dehned for each a > 0: 


La : = max | 


fix)il - fix))' /(x)2(l - /(x))2’ /(x)(l - /(x)) 

/(x)(l -/(x)) 

/'(x)2 


\nx)\ 


sup 


fix)" 


sup 


\r{x)\ 


la : = sup 
|a:|<a 




and 


(29) 


These quantities are similar to those in the paper 24 , along with the additional control over the 
second derivative f" required for proving (fast) geometric convergence. We introduce the shorthand 
which we think of as a measure of SNR. In the constant SNR setting v = 0(1), the 
quantities and are positive universal constants independent of the other model parameters 
d, r,p etc. 

We now apply Theorem]^ to the one-bit matrix completion problem. Set 


p = Cl max 


I ’ IauLap ^ ’ 


a = C2 


P 

^41/0-2 ' 


L = (3 = C3, en = C4^CrLiyi4,yil +v) 


a^ 


As we shown in the proof of Corollary]^ if the initial matrix satishes the condition d(T®,T*) < 
1 — yjl — p, then the set T is M*-faithful. Moreover, if the expected sample size satishes the bound 
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(a) (b) 

Figure 4. Simulation results for one-bit matrix completion. Panel (a); plots of OB optimization error 
d{F^,F'^) and statistical error d{F*,F*) versus the iteration number t, using random initialization. 
The simulation is performed using d = 1000, r = 3 and p = 0.5. Panel (b) plot of per-entry 
estimation error ^d{F, F*) versus for different values of {d,r) using random initialization. Each 
point represents the average over 20 random instances. The simulation is performed using p — 0.5 
and 


n = pd? > C 5 max{/rrdlog d, dlog^ d,and is large enough to ensure £n < ^(1 — \/l — p), 
then with probability at least 1 — CQd~^, the loss function Cn associated with (28) satishes the local 


descent, Lipschitz and smoothness conditions with parameters p, a, L, /3 and given above. Using 
these facts, we obtain the following guarantee, which we state assuming that the sample size n 
satishes the above conditions. 

Corollary 5. Under the previously stated conditions, if we are given an initial matrix with 
d{F^,F*) < 1 — then with probability at least 1 — ce,d~^, the gradient descent iterates 


xrzt/i step size rf = c^ 




satisfy the bound 


d\F\F*) < (l-c8-^-^- yd^{F^,F*) + cga^LljUl + u)^-. 

V r,..L,..urJ V 


(luLl,,p.r 


dr 

p 


See Section 6.5 for the proof of this claim. 

In order to interpret the above result, let us consider the setting with a constant SNR u = 0(1), 
in which case ^ ^ ||oo- Corollary guarantees that given an initial matrix F^ within 

a constant radius of F*, the projected gradient descent converges geometrically and has per-entry 
error 


d^ 


IP < _IIIF* 

IIf S 


(IT' flT' 

d\F°^,F*) < —X — \\F*®F* 
n n 


||2 ^2 
Hop 


|2 

I OO • 


(30) 


This bound has the same form as that in Section 4.1 for standard matrix completion, with the 


important difference that it is an essentially multiplicative bound where the pre-factor depends on 
the SNR V. 

It is worth comparing our error bounds with previous results under the setting u = 0(1). One 


body of past work ^4 16 has studied the recovery of approximately low-rank matrices with bounded 
nuclear norm—that is, matrices whose vectors of singular values are in the iq ball with q = 1- This is 

a milder sparsity assumption, and so leads to the slower error rate . The result here applies 

to exactly low-rank matrices {q = 0), and so leads to the faster rate Both of these scalings are be 
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minimax-optimal in the simpler linear setting |51| . On the other hand, Bhaskar et al. also analyze 
the case of exactly low-rank matrices, bnt their algorithm relies on rank-constrained optimization 
and does not have convergence gnarantees in polynomial time. Moreover, their error rate scales as 

,7 3 

and thns has a worse dependence on r, d and n as compared to onrs. 


Initialization and time complexity: In theory, we can obtain a good initial solntion by 


solving one of the convex programs in the papers l24, 16 followed by a projection onto the set F. 


Since we only need the initial error to be a constant, it snffices to have n F dr + d\og d observations. 


In fact, in onr simnlations, we hnd that a randomly chosen initial matrix is often good enongh (see 
Fignrej^a)). Given snch an initial solntion, the projected gradient iterates converges geometrically 
with a contraction factor 1 — so we need 0 {^rlog{l/6)) iterations to compnte a h-accnrate 
solntion. Therefore, we can achieve the O(^) error rate in polynomial time; to the best of onr 
knowledge, this polynomial-time gnarantee for achieving the minimax-rate in the exact low-rank 
case is the hrst snch resnlt in the literatnre. 


Simulations: We performed experiments nnder the same general set-np as the matrix comple¬ 
tion (see the discnssion snrronnding eqnation (24)). The matrix F* is random orthonormal, and 


the observations are generated nsing the Bernonlli model with observation probability p and the 
standard Ganssian GDF as the link fnnction / with noise magnitnde o' = ^- The initial matrix 
F^ is obtained by random initialization. The step size is hxed at 7* = . Panel (a) of Fignrej^ 

illnstrates the geometric convergence of the algorithm. 

In terms of the scaling of the estimation error, with a = ^, n = pd? and p hxed, the per-entry 
error of the ontpnt F satishes 


with high probability; cf. eqnation (30). Therefore, we shonld expect that the squared error 
^d^{F, F*) scales proportionally with the ratio a prediction that is conhrmed in Fignre ^b). 

4.6 Low-rank and sparse matrix decomposition 


Recall from Section 2.2.3 the problem of noisy matrix decomposition, in which we observe a noisy 
snm of the form Y = F*®F* +S* +E^ where E is a symmetric noise matrix. Onr goal is to estimate 
F*, and in this section, we analyze a version of this model in which the factor matrix F* G 


bdxr 


has eqnal eigenvalnes and incoherence parameter p as dehned in eqnation (21), and the pertnrbing 
matrix S* G 5'^^^ is element-wise sparse. 


One line of work concerns the setting where the non-zero entries of S* are randomly located |19[ 

, whereas another line of work focnses on deterministic models [35| . We focns on one version 

of the deterministic setting, in which each row/colnmn of the matrix S* has at most k non-zero 
entries, whose locations and valnes are otherwise arbitrary. In light of keeping the presentation as 
simple as possible, we assnme here the valnes of ||5't. ||i, the ii norm of each row of S* are knownj^ 


"^This is unrealistic and could be relaxed, albeit at the price of more involved analysis of the Lagrangian version 
instead of the constrained version. 
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Using the nuclear norm and l\ norms as surrogates for rank and sparsity (respectively), the 
constrained version of a popular convex relaxation approach is based on the SDP 

min |J(min|||y- (M + S’)|||2) + A|||M|||„,<,|, 

1 2, SeS ' i 

where 5 : = {S' G | US'*. ||i < ||5t||i,z = 1,2, Alternatively, we may drop the nuclear 

norm regularizer and solve the factorized formulation by projected gradient descent, as applied to 
the problem 

Cn{M) = ^ mm |||M + 5 - y|||2, T = [f \ \\F\\ 2 ,oo < 

Note that Cn{M) is the squared Euclidean distance between the point Y — M and the closed convex 
set S. Therefore, the function Cn is convex and has gradient 


YuCniM) = M + n5(y -M)-Y. 


We now derive a guarantee for this problem using our Theorem As we show in the proof 
of Corollary if the initial matrix F^ satisfies d{F^,F*) < g, then the constraint set F is M*- 
faithful. Moreover, if the sparsity of the matrix S* satisfies < ci and the noise matrix satisfies 
III-E'III F < C2|||-F*|||op, then the loss function and the feasible set satisfy the local descent, Lipschitz and 
smoothness conditions with parameters 


P = 




III 2 

III op 5 


L = p = 48\lF*\ll 


and 



Using these facts, we have the following guarantee, which is stated assuming that the matrices S* 
and a satisfy the assumptions above. 

Corollary 6 . Under the previously stated eonditions, given any initial matrix F^ satisfying the 
bound d{F^,F*) < g, the gradient iterates with step size r/* = C 3 satisfy the bound 


d\F\ F*) < (1 - C4)*d2(F0, F*) + C5 


lll^llll 

||2 • 
II op 


\F* 


(31) 


See Section 6.6 for the proof. 


The condition matches the best existing results i 


m 
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for the deterministic setting 
of matrix decomposition. As a passing observation, the above results can be applied to matrix 
completion with adversarial missing entries—by arbitrarily filling in the missing entries and treating 
them as sparse corruption. Corollary guarantees recovery when each row/column has at most 
^ missing entries, whose locations can be arbitrary. 


Initialization: We describe how to get a good initial matrix F^ in the noiseless setting E = 0. 
Suppose |||T’*|||op = 1. Let Y be obtained from hard-thresholding Y at the level that is, for each 
element {i,j), 


Y 

J- r-1 


Yii, if \Yii\ < 

fsign(y,,), if|y.,|>f. 
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(a) (b) (c) 

Figure 5. Matrix decomposition: plots of optimization error d{F*,F'^) and statistical error 
d{F*,F*) versus the iteration number t, using (a) SVD-based initialization and (b) random ini¬ 
tialization. The simulation is performed using d = 600, r = 5, k = 100 and cr = 0.1 • g. Panel 
(c): plots of the probability of successful exact recovery of F* versus for different values of {d, k) 
using SVD-based initialization. We declare exact recovery if d(F,F*) < 2 x 10“^, and each point 
represents frequency of exact recovery over 20 random instances. The simulation is performed using 
r = 6 and cr = 0. 


We then set to be the dxr matrix with columns being the top-r singular vector of Y projected 
onto the set F. In Appendix]^ we prove that under these conditions, we have 

(32) 

d 

Therefore, the requirement in Corollary is satisfied if < ci for a universal constant ci that 

is sufficiently small. The condition < ci is sub-optimal by a factor of y/r. 

Computation: To compute the gradient VM^n{M) = M + n5(y — M) — Y, we need to project 

the projection Iljr can be computed by row-wise clipping. 


each row of T — M to the ii balls, which can be done efficiently l27l. As discussed in Section 4.1 


Simulations: We performed experiments under the same general set-up as the matrix completion 
(see the discussion surrounding equation (24)). The matrix F* is random orthonormal; the sparse 
matrix S* has k x d non-zero entries whose locations are sampled uniformly without replacement 
and whose values are independently and uniformly sampled from the interval [0,10 • g]; the noise 
matrix E has i.i.d. zero-mean entries with standard deviation a. The initial matrix F^ is obtained 
using the SVD-based procedure described in Section 4.6 The step size is fixed at ??* = 1. Panels 


(a) and (b) of Figure [^confirms the predicted geometric convergence. 

In terms of the estimation error, in the noiseless setting with cr = 0, the output F equals F* 
with high probability provided that ^ 1- Therefore, with r fixed, exact recovery of F* can be 

achieved with probability close to one as soon as ^ is below a constant threshold. Panel (c) of 
Figure [^confirms this prediction. 
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5 Proofs of general theorems 


This section is devoted to the proofs of our general theorems on projected gradient convergence, 
namely Theorem in the Lipschitz case, and Theorem under stronger smoothness conditions. 
Throughout these proofs, we make use of the convenient shorthand G* : = VCn{F^) for the gradient 
of Cn at step t. We also define the difference matrix A* : = ~ T*, as well as the parameters 

Tp : = |||T*|||op and ar : = ar{F*). 


5.1 Proof of Theorem [T] 

Our proof proceeds via induction on the event 


Ti := {d{FpF*) < {l-T)ar for all s e {0,1,..., t}}. 

'-V-^ 

p 


(33) 


For the base case t = 0, note that £q holds by the assumptions of the theorem. Assuming that 8t 
holds, it suffices to show d(T*'*“^, F*) < p, which then implies that St+i holds. 

We require the following auxiliary result: 


Lemma 1. For any matrix F G such that d{F,F*) < ar{F*), the optimization problem 

miu^g^(;^^*) |||A — F\Ip has a unique optimum F^* such that (i) the matrix F~^F^* G is positive 

definite; and (ii) the matrix (F — F^-*)'^F^* is symmetric. 


See Section 


F^ 


5.1.1 


for the proof of this claim. In view of Lemmaj^ the matrix Fp* : = arg min^g£-(jy^*) 


|A- 


is uniquely defined for each time step s G {0,1,..., t}. 


The projected gradient descent update can be decomposed into the two steps 

ps+i ^ps _ and = n^(F"+^). (34) 


For each s G {0,1,..., t}, the local descent condition (14) implies that 

{{VCniF^, F^ - FP4 > a|||F^ - F^fiH - ael 

On the other hand, from the decomposition (34), we have VCn{F^) = 
above inequality implies that 


, and hence the 


«|||T^ - FP 41 - ael < F^ - F^ 

rjs 


1 


= —4F^-FP4l + \lF^-F^ 

= ^(|||F^ - F,A|||2 - |||F^+i - FP4l) + rj4VCn{F4\ll 

Due to the M*-faithfulness and convexity assumption on F, we are guaranteed that Fp* G F, and 
hence 


pS+l Ijj 2 

II F 


- |||F"+i -F^A|||2) 


^^+1 _ F^ 


< |||F"+i -F* II 
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since Euclidean projection onto a convex set is non-expansive [^. Moreover, by the Lipschitz con¬ 
dition (15) on Cn, we have |||V£n(E^)|||p < Combining the pieces, we find that 


1 


a|||F^ - - ael < - F^4l - - F^4l] 


^)+r?*LV- (35) 

Introducing the shorthand 7 := — 1, we then make the step size choice 7 ® = 7(777771 • 


Substituting into our bound (35) and rearranging yields 

a(s + 1 + 7 ) III ^,+1 iii 2 ^ a(s- 1 + 7 ) 111 ^, iii 2 , 

lll-^ -^7r*lllF — ^ lll-^ -^7r*lllF“r 


-I- ae^. 


2 2 -™' ' a(s + l + 7 ) 

Multiplying both sides by s -|- 7 and using the fact that — F^*|||f yields 

a(s + 7)(s + l + 7) |||^^+i j. 5+1|||2 + 7)(■§-1 + 7) 1111:.^ 1:^^1112, (■s + 7) 




a(s -I- 1 -I- 7 ) 


L V' + “(s + 7)en' 


2 a 


Summing the above inequality over 5 = 0,... t yields 
a(i + 7) (i + 1 + 7) III E^t+i III 2 ^ «7^ III m 2 

o lll-^ ^TT* IIIf — o IIk -^7r*lllF 


H--|- oi{t -|- l)(t -|- 7 )^n- (36) 

a 


Now observe that the assumptions r < ^ and a < L imply that 7 > ^ 1- These 

inequalities, combined with the facts that |||F® — F^*|||f < (1 ~ 'r)( 7 r by assumption and il^/n = <7^, 


when applied to the bound (|36|), yield 

|||_p*+i _ p 


7 


'tH-l |||2 ^ _ 

“ (t + 7)(i + l + 7) 


(1 - rfal + 


(t + 1 ) 7/2 


(t+ 7 )(i + 1 + 7 ) 


/I _^ 2 V’^ , 2 (t+ 1 ) 2 


7 ^+ (^ + 1 ) 7/2 , -'i 2_2 I 2 (t-Fl) 2 

-(1 — t ) a,. + - -: - e„. 


(t + 4(t + ^ + 7 ) ' ~' t + 1 + 7 ' 

This bound, together with the assumed bound Sn < 2 ^'^’’ yields 

lllpf+i pi+i|||2 / 7^ + (i + 1)7/2 + (i + l)(i + 7)/2 (, n 2 2 / (1 _'i2_2 2 

IF -F,. |||,< - (t + ^)(t+l + ^) -( 1 -r) <r,<(l-r) a, = p 

whence d(F^~^^, F*) < p, thereby proving the induction hypothesis for f -|- 1. 


(37) 


Moreover, since 7 > 1, the inequality (37) implies that 

|||_p*+i _ p 


t+llIlF ^ 


2 ^ 'y ft _\ 2_2 I 0^2 ^ 201 /^, ^^2 

< -(1 — ) <7^ T 2 e^ < ——^^ -|- 4e^ 


t + 7'^ ^ - (t + l)a2 

thereby establishing the bound 0 stated in the theorem. 
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5.1.1 Proof of Lemma [T] 

We use the shorthand at = (Tk{F*) for k = 1,..., r. Since d{F, F*) = min |||P — ^|||f < (Jr, there 

Ae£{M*) 

must exist a matrix Pq G £{M*) such that 

|||F-i"olllop < |||P-PoIIIf <fTr- (38) 

It follows that the matrix F must have full column rank with all its singular values contained in 
the interval [ur — d{F,F*),ai + d(P, P*)]. Let 1^ denote the r-dimensional identity matrix, and 
the rank-r SVD of Fq* be F* = VSR^, where V G S = diag(ai,... and F G is 

orthonormal. Any unit vector in has the form Rw for some unit vector tc G and F~^FqRw = 
(F* + F - F*)^VSw = {RS + (F - F*)Ti/)5u;. We therefore have the bound 

||F^Fo*Fu;|| 2 > u„,in(FS + (F-Fo*)^P)||Fu;|| 2 > (u„,i„(FS) - d(F, F*))^, = (u, - d(F, F*))u,. 

It follows that the r-dimensional matrix U : = F''~Fq satisfies ariU) > (ur — d(F, F*))cJr > 0 and is 
thus invertible. A similar argument shows that (Ji{U) < (cri + d(F,F*))cri. 

Defining the matrix Fj^* : = FqU~^{UU~^, it is easy to verify F^-* G £{M*). Observe that 

F^F^* = F^ F^U^ {UU^ = UU^iUU^)-^/^ = 

which is symmetric and positive definite since U has strictly positive singular values. It is then clear 
that the matrix (F — F^,-*)'^F.^* is symmetric. 

Any matrix A G £{M*) can be written as A = Fn-*^ for some orthonormal matrix H G 
whence 

r r 

trace(F'''A) = trace < |||H|||op ^ (Ti(t/) = ^^ai{U). (39) 

i=l i=\ 

Noting that 

r 

|||F-F^*||If = |||F|||2 + |||F,*|ll?-2trace((t/[/^)'/2) = |||F|||2 + |||A|||2-2 ^u,(C/), 

i=l 

we thus have the bound 


IIF-F^ 


< 

IIF — 


2 + 
F ' 


- 2 trace (F^A) = |||F-A||| 


Since A was arbitrary in £{M*). we conclude that F^^* is a constrained minimizer of |||F — F*|||f 
over £{M*). 

Finally, we claim that the inequality in (39) is strict if H ^ R, so that F^-* is iu fact the 
unique minimizer, as claimed. To establish this strictness, suppose that the SVD of is 

given by = R'TiR!^^ where S = diag(cri(t/),..., ar{U)) and F' is an r x r orthonormal 

matrix. Then trace ((FF''~)^/^H) = trace(SF'''~HF') = trace(SH'), where E' := R'~^ER' is also 
orthonormal. If H 7 ^ F, then H' 7 ^ R and therefore 

r r r 

trace ((FFT)V 2 h) = 


i=l 


2 = 1 


2 = 1 


where the inequality follows from the facts that ctRU) > 0 for each index i € [r], and since E' ^ R, 
we must have \Ek\ < 1 = ||S(.II 2 for some index i G [r]. 
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5.2 Proof of Theorem 


We will in fact prove a slightly stronger form of the theorem, where the L-Lipschitz condition is 
replaced by the following relaxed condition: 


|((Vm/: 4P®T), F-F'))| <L(|||F*|||f^ + |||F*L,|||F-F'|||p), 


(40) 


valid for all F G Ffl IB2((1 — F*) and F' G F. We use the step size choice = 

F(1-t)8 


c 




where C = 272. 


As in the proof of Theorem]^ we will show that d(F*, F*) < (1—T)'ip for all iterates t = 0,1,2,.... 
We do so via induction on the event 8 t previously dehned in equation (33). As before, the 
base event £q holds by the theorem’s conditions. The induction step is based on assuming that 
£t holds, and then showing that £t+i also holds. As before, it suffices to establish the bound 
d(^t+i, < (1 — The proof is divided into several steps below. 


Showing that d{F^~^^,F* ) < p = (1 — In the hrst step, we establish a slightly weaker 

bound on d(F*+^,F*). 

In view of Lemma the matrix F** := argmin^g^-^^*) |||A — F^|||p is uniquely dehned for 
each time step s G {0,1,... ,t}. Recall that the iterate F*'*'^ is an optimal solution to the convex 
optimization problem Consequently, the hrst-order conditions for optimality imply that 


1 


((G* + —A‘, -F*+^ + F)) >0, VF G F. 


(41) 


Applying this condition with F = F* and using the relaxed Lipschitz condition (50) with the 
assumption L = /3, we obtain 

IIIA*|||^ < -A')) < + V’lllA*|||p) = + V^|||A*|||p). 

With the constant C = 272 and the fact that max{r, 1 /k, a//3} < 1, this inequality implies that 
< r(l — T)ip and hence that 


d(F*+\F*) < |||F*+^ - F, 


't+i 


' 7 r*lllF < |||F - F^. If + 


< (1 — 


(42) 


Note that we have not yet completed the induction step, but the bound (|42|) is useful below. 


Establishing a recursive bound: With F*+^ satisfying the bound (42), Lemma guarantees 
that the matrix F!^^ := argmin^g^-^^;^*) |||F* — F^+^lp is well-dehned. Our analysis involves the 
matrix differences A^, : = F^ —F^l, dehned by pairs s,s' G {0,l,...,t + l}. The following inequality 
is central to our analysis: 


2,((a‘. -a;+‘))>pI||a;+!iii|- 


136k».8' 


— 3ae„. 


ar° 


(43) 
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We return to prove it shortly; taking it as given for the moment, it follows that 


|||A^}|||?<|||A*+ie = |||F*+i-F;|||2 

= |||F* - - F;1\1 - III A* III 2 + 2{{A\ - F^ 

V till 2 III A till 2 , t ( _iiiAt+liii 2 , 272 k®/ 3 ^ 


< III A* IP - 
^ III ^t III F 


v*IIIf 


^t+llIlF 

A 2 

IIA 


ar' 


l^IIIp F 6 Q!£? 


= IIIA^lP - IIIA*lP - «^'^^(l-'^)^ ||| At+i |||2 , 272(l-r)^ ^ 2 6 aV^(l-r)^ ^ 

III ^t III F IP'-III F lll'^t+ilIlF + Q IP'-IIIf “T Ck!°I3‘^ 

< IIIA‘1115 - AP|||A‘+!|i + ^ -2 


— lll^dllF 


(7^6/32' 


Ck6/32 


e„. 


where we used the step size choice if = and the assumption C > 272 in the last two lines. 

Rearranging this inequality yields the recursive bound 


ll|ASllll?<(l + ^)"(lll 


A^lll^ + 


6 a^r® 2 '\ ^ A 

A (^1 - 2C7^6^2 


A^IlP -I- 
^tWlF ^ 


6 a^r® 

Ck®/32' 


(44) 


Completing the induction and proof: Since |||A*|||f < (1 — r)V’ by induction hypothesis and 
£n < by assumption, the inequality (44) above implies 

|||A*+J|||2 < (i__^)(i + _^)(l_r)V < (1 -t)V, 

which completes the induction step. Moreover, by applying the inequality (44) recursively, we hnd 
that 


l|A^||| 2 < ( 1 - 


(l ^^VlllA® 

2 CK 6 / 32 J 


0 2 , 6 a^r® 2 


F w 


a 2 r® Ck^I3‘^ 




e < 1 - 

2Ck6/32. 


j II|Ao|||f + (4en)^ 


thereby completing the proof of the theorem. 


Proof of inequality (43): It remains to prove the intermediate claim (43). With d(F*'''^,F*) < 
p = (1 — r2)^ as established in (42), the local descent condition (14) yields 


((G‘+', Agi)) > a|||A;+!|||| - ^lllFgi - F, 


-pi |||2 


— 


In order to proceed, we need a second technical lemma. Recall that k = is the condition 

number of F*. 

Lemma 2. Under the eonditions of Theorem^ we have 

IOk^ 




II F ^ 


See Section 5.2.1 for the proof of this claim. 
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Applying Lemma we find that 

«G‘+', A‘+‘)) > olllAmU - 
On the other hand, the smoothness condition yields that 

((0‘ - 0‘+i, a;+‘)) > -/3|||A‘|||p|||a;+‘|||p - aj„|||A'+‘|||, 
Together with Lemma we obtain 

((G* A‘+i)) 

IOk^IALI 


> - 


A If T Oi£f 


t+llllF 


+ 


WK^j3 


= -/3|A*iFiA*+}|F-ae„iA‘+J|F- 


^‘If 


10k3 


a 


(i) 


> - (fll|Ai:lli+ ^IIIA‘1111) - (|ll|A'+!|||| + a4) - 


Wk^P, 


a 

a ,„o „ 0 36k^I3‘^ 


^'IIIf 


-( 


25K®a, 




^^If, 


(45) 


(46) 


iLl |2 


+ as: 


(47) 


2 .. p,....- .p 

where the step (i) follows from the AM-GM inequality. 

Finally, the M*-faithfulness of J- ensures that T** G J-, so that we may apply the bound (41) 
with F = thereby obtaining 


1 


((G* + -A^ -A‘+i))> 0 . 


7]'^ 


(48) 


Adding together inequalities (45), (|47|) and (48) yields the claim (43). 


5.2.1 Proof of Lemma [ 2 ] 


By dividing through by 'i/i we may assume that t/j = 1, so ai{F*) = n, where we recall that k is the 
condition number of F*. Define the matrices Ug '■ = (F^)'^F^* for s G {t, t + 1}, and recall that we 
have shown 

max||F‘-F**|F,iP‘+^-Pi*|F} < l-r^. 

The same argument as in the proof of Lemma from the previous section show that the singular 
values of Ug are in the interval [r^, , and we have the expression T®* : = F^* Uj {UgUj )“^/^ 

for s € {t,t + 1}. Since Ut = = {UtUj )^/^, we have 

m+i - Uth = < tTi|A*iF. 


By applying a known perturbation bound for matrix square roots ( |29[ Lemma 15]), we find that 

- Ut+iUl,\y 




amXUtUj ) 1/2) + CT„,i„((t7t+il74i) 1/2) 


<i^\\UtUj -Ut+lUXh 

= - t/m)^|F 

2 k? . 

< If- 
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Moreover, we have 


7 -T 




< 


2k^ 


Putting together the pieces, it follows that 

= \lFUUt+i - UtV{Ut+,Ul,)-F^ + FiM^ [([/t+iC/4i)-V2 _ (f/il7/)-V2 
< |||F‘*LJ||C/t+i - UtU{Ut+iUl,)-^/%, + |||F**L,|||C/T \{Ut+iUl,)-F^ - {Utu;^y 


< K ■ K 


If • -2 + K 


2 k^ 


< 


10 k" 


as claimed. 


6 Proofs of corollaries 

In this section, we prove the corollaries by applying our general theory. 

The general theorems in Section are stated in terms of the loss function Cn of the fac¬ 
tor variable F. Sometimes it is convenient to work with the original loss function Cn of the 
d X d variable M. These two loss functions are related by Cn{F) = Cn{F®F) and VpCniF) = 
[VM^niFiSiF) -|- {VMFn{FiSiF))'^~\F, and the convergence results can be restated in terms of Cn- 
We do so below for the result in Theorem [2j 

The following conditions for Cn are the counterparts of the corresponding conditions for Cn- 

Definition 5 (Local descent condition for Cn)- For some curvature parameter a, statistical tolerance 
En and radius p, we say that the cost function Cn satisfies a local descent eondition with parameters 
{a,£n,p) over F if for each F G F riM 2 {p] F*), there exists Ft^* G argmin^j^tg^^jy^*) ||T* — F||f such 
that 

((Vm>C„(F®F), F^F - F^.(g)F„* + {F - F^*)^{F - TV*))) > 2q;||F - F^tHIIf “ ^C(^n\lF - TV*||f. 

(49) 

Definition 6 (Relaxed Local Lipschitz condition for Cn)- For some Lipschitz constant L and radius 
p, we say that the Cn satisfies a relaxed local Lipschitz condition with parameter (L, p) over F if for 
each F G F f^M 2 {p^-, F*) and F' G F, 

\{{VMCn{F®F), {F-F')®F))\ < lL(||F*i|2^ + i|F*iVVI|F-F'i|F). (50) 


1/2 
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Of course, this relaxed Lipschitz condition for is implied by a Lipschitz condition of the form 

|||Vm>C„(F®F)F|||p < VF G FnB2(p;F*). (51) 

Definition 7 (Local smoothness condition for Cn)- For some curvature and smoothness parameters 
a and /3, statistical tolerance Sn and radius /?, we say that the loss function satishes a local 
smoothness condition with parameters (a, /3, e„, p) over F if for each F, F', F" G F H F*) and 
any F* G 

\{{VMC-n{F®F) -VMC-n{F'®F'), F'®{F-F*)))\ < 1 (/3|||F - F'|||p + ae„)|||F - F*|||p, (52a) 

\{{VMCn{F®F), {F-F*)®{F' -F")))\ < ^(/3|||F'- F"|||p + ae„)|||F - F*|||p. (52b) 

Now suppose that for some numbers a, /?, L, and r with 0<a</3 = L, 0<r<l and 
En < the empirical loss C,n satishes the local descent, relaxed Lipschitz and smoothness 

conditions in Dehnitions|5j|^over F with parameters a, j3, L, and p = (1 —r^)iTr(F*), that the set 
F is M*-faithful and convex, and that the matrix VCn{M) is symmetric for any symmetric matrix 
M. As we show in the proof of Theorem the loss function then satishes the corresponding 
conditions with the same parameters. Consequently, we have the following result: 

Theorem 3. Under the previously stated conditions, the conclusion in Theorem^ holds. 

See Section]^ for the proof of this claim. 

In remainder of this section, we verify the above conditions for each of our examples. It is easy 
to see that in these examples the matrix V/1„(M) is indeed symmetric for any symmetric M, so it 
remains to verify the conditions in Dehnitions [5}|^ for £„ and the M*-faithfulness of F. 

Recall that cjj and k are the Tth singular value and the condition number of F*, respectively. 
Throughout this section, we let F be an arbitrary matrix in FriM 2 {p] F*) and M = F®F, where F 
and p are specihed for each of our examples. In all these examples /? < cr^, so Lemmaguarantees 
that we can write F^^* : = argminy^g^j-^*) |||A — F|||p and A : = F — and A'^'F^-* is a symmetric 
matrix. Let F* be an arbitrary matrix in £{M*), and recall that M* = F*'S>F* = F;r*®Fn-*. Denote 
by C, c. Cl etc. positive universal constants, whose values could change from line to line. 


6.1 Proof of Corollary 

We begin by proving our claims for the matrix sensing observation model. By dividing through by 
ar, we may assume without loss of generality cr^ = 1, so k = fii. Recall that a = 6 ^ 4 ^, L = f3 = 64k^, 
En = p = I — 12 ^ 4 ^. It is a standard result that RIP implies preservation of 

inner products between low rank matrices, as summarized in the lemma below: 


Lemma 3. If Xn satisfies a RlP-d^r condition, then 


1 

n 


{Xn{A), Xn{B)) - {A, B) < < 54 r|||^|||c|||F 


for all matrices A, B £ of rank at most 2r. 
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For completeness, we provide a proof in Appendix B.l 


Under the matrix sensing observation model (|^, the gradient of Ln takes the form 

VCn{F®F) = - F*®F*) - -X;(e), 

n n 

Below we verify the local descent, Lipschitz and smoothness conditions. 


(53) 


Local descent: We have the decomposition ((V£„(M), M — M* + A (g) A)) = Ti + T 2 , where 
Ti := t((X;X„(M-M*), M-M*+A® A)) and Ta : =-t ((X;(e), M - M* + A ® A)). 
Lemma implies that 

Ti = -{{Xn{M - M*), XniM -M* + A<s> A))) 
n 

M - M* + A (g) A)) - 6 ir\lM - M*|||f|||M -M* + A® A|||f. 

Since the matrix A~^Ft^* is symmetric, some algebra shows that 

{{M - M*, M - M* + A® A)) = {{F^* ® A + A ® F^* + A ® A, FV* (g) A + A (g) + 2A (g) A)) 

= 2|||F^* ® A|||2 + 2{{F^* ® A, A (g) A)) + 2|||(F,r*)"^^ + 

> 2|||F,r* ® A|||p(|||F,r* ® A|||p - |||A|||p) 

In addition, we have 


|||M - M*|||p|||M - M* + A (g) A|||f = |||F,,. ®A + A®F„, + A® A|||f|||F,,. (g) A + A (g) F,,. + 2A (g) A|||f 

< (2|||F^* ® A|||p + |||A|||2) (2|||F^* ® A|||p + 2|||A|||2). 

It follows that 

Fi > |||F^. ® A|||p((2 - ® A|||p - (2 + 6,54.)|||A|||2) - 2(^4r|||A|||^ 

> IIIA|||p((2-454r)|||A|||p-(2+ 654 .)(l-m 4r-)|||A|||p) -2,54.111 A|||2 > 12,54.111 A|||2, 


where the second step uses the inequalities |||F^*® A|||f > cJr|||A|||F = |||A|||f and |||A|||f < p = 1 — 1254,.. 
On the other hand, we have 

iTal <|||n-iX);(e)|||op • V^\\M - M* + A® A|||p 

<|||n-iX);(e)|||„p • C^(2|||F„* ® A|||p + 2|||A|||2) < |||n-iX);(e)|||„p • 6^/fK|||A|||p, 


where the last step uses the inequalities |||Fn-* ® A|||f < o'i(Fn-*)|||A|||F = k|||A|||f and |||A|||f < 1. 
Combining this upper bound with our lower bound on Ti, we find that 


((V£„(M), M - AF + A ® A)) > 12^4. 


'TK , 


2^4. 


n X*Je 


n\^/lllop 


|A|| 


thereby establishing the local descent condition (49) for 
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Local Lipschitz and smoothness: We have the following variational representation: 


|||V/:n(F)|||p = sup {{VCn{F(^F), H®F)). 


Using the form of the gradient S/Cn given in equation (53), we have 


VCniF^F) - VCniF'^F') = - F'^F'Y 


Note moreover that 0 G 182 ( 1 ;-F*), a < L = (3 and < 1. Using these facts, it can be verihed that 
the local Lipschitz and smoothness conditions (51 )"(52b) for are implied by a bound of the form 


\{{n-^rMF®F - F'®F'), H®G))\+\{{n-^Xl{e), H ® G))\ < ^ (/3|||F-F'|||p+ae„) |||i/|||p|||G||Up, 

(54) 

valid for all F, F' G 182 ( 1 ; and for all FT, G G 

Let us prove the bound (|54[). Lemma guarantees that 


\{{n-^Xl?in{F®F - F'®F'), H ® G))\ < (1 + h4.)|||F®F - F'®F'|||p • |||F|||p|||GLp 

< (1 + 6 ,r){l\F\U + ll|i"'ll|op)|||i" - F% • |||F|||p|||GLp 
<84F-F'MHMGll„ 


where the last step follows from the facts that |||F|||op < |||F*|||op + d(F, F*) < 2 k: for all F gI 82 ( 1 ;-F*) 
and Sir < 1- We also have 


K(n-'x;(e), F ® G))| < |||n-iX);(e)Lp • |||F ® G|||_ < |||n-'x;(e)Lp • V^|||F|||p|||G||Up. 


Combining these inequalities and recalling the values of a,(3,en yields the claim (54). 


6.2 Proof of Corollary 

We now turn to the proof of our claims for the matrix completion model. By dividing through 
by III F* III op and using the equal eigenvalue assumption, we may assume without loss of generality 
III -F* III op = o'r(F*) = 1 . We hrst show that F is M*-faithful. Note that F is the set of matrices with 

each row in the £2 ball of radius 7 := '\/^|||-F°|||op- Because F° G 182(g;F*), we have |||F°|||op > 

I III F* III op, whence 7 > Y^^|||F*|||op. Combined with the dehnition of the incoherence parameter /i, 
we see that any matrix F* G £{M*) satishes the maximum row norm bound ||F*|| 2 ,oo < 7 ) so that 
F* G F as desired. 

For future reference, we make note of a useful property satished by any matrices F G F C 
182 (/ 0 ;F*) and F* G £{M*). As a consequence of the clipping operation 11 j-, the row norms of the 
matrices F and F — F* satisfy the bounds 



(55a) 

(55b) 
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where we use the inequality |||-F*^|||op < lll-^*ll|op + d{F^,F*) < ||||-F*|||op and the normalization as¬ 
sumption III-F* III op = 1- Inequality (55b) applies in particular to the difference matrix A : = F — F^^*, 
where we recall that the matrix F.^* := argmin^g£:(jy^«) |||^ — Ffp is uniquely defined thanks to 
Lemma [T] 

It remains to verify that the local descent, Lipschitz and smoothness conditions are satisfied 
with high probability. Under the matrix completion observation model (dn, the gradient takes the 
form Cn{M) = — M* — E), where IIq is the projection operator. We need two technical 

lemmas. The first lemma shows that the projection operator IIq approximately preserves inner 
products between matrices whose column or row spaces are equal to the column space of F*. 


Lemma 4. There is a universal eonstant c such that for any 0 < e < 1 and p > , uniformly 

for all H,G ^ we have 


\p-^{{Uu{F* ^H), UniG®F*)))-{{F*®H, G ® F*))| < e|||F*|||2 
\p-\{Un (F* ® H), UniF* ® G))) - {{F* F*® G))\ < e\lF*\ll 

with probability at least 1 — 2 d~^. 


See Appendix |C.l for the proof of this claim. 


(56a) 

(56b) 


Our second lemma is useful for controlling the projection of “small” matrices to U. 


Lemma 5. There is a universal constant c > 0 such that for any e G (0,1) and p > 
then unformly for all matrices Z G G G and H with ||L^|| 2 ,oo < 

p-^\lUn{H ® H)\ll < (1 + e)|||F|||| + e|||F||||, 
p-^\lUn{Z)H\ll<72pr\lUn{Z)\ll, 

P“^ll|nf7(</'® c^)|||| < 72/ir|||G|||| 


en d + 
we have 


logd\ 
d )’ 


(57a) 

(57b) 

(57c) 


with probability at least 1 — 2d 

See Section for the proof of this claim. 


For the remainder of the proof, we condition on the intersection of the events in Lemmas 
and |5] 

Local descent: We have the decomposition ((V£„(M), M — M* + A0A)) = Ti — T 2 , where 
Ti :=-((no(M-M*), nn(M-M* + A®A))), and r 2 : = -((no(^), M - M* + A®A)). 

P P 

Our strategy is to lower bound Ti and upper bound \T 2 \. Beginning with Ti, we have 
Ti = p-^{{n.niF^* (g) A + A (g) F,,* -h A (g) A), Un{F^* ®A + A0F^*+2A0 A))) 

> p-i(|||nn(F^. ® A + A ® F^OIIIf - 2|||nn(A ® A)|||p) (|||nn(F^. ® A + A ® F,*)IIIf " ll|no(A ® A)|||p 
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Recall that p> ^ —I" assumption of the corollary. By Lemma we find that 

p-^\lUn{F^* ® A + A ® F^OIIIf > (1 - e)lll^-* ® A + A ® F^*|||2 > 2(1 - e)||| A|||2, 

where the last step follow from the inequality 

|||F^*®A + A(8)F^*|§ = 2\lFn*0A\ll + 2{{Fn* 0 A, A0F^*)) = 2|||F^* (8)A|||2+2|||A’^F^. > 2|||A|||2 

thanks to the symmetry of the matrix A'^'RV* (cf- Lemma |l|). Since ||A|| 2 ,oo < 111 ^IIIf < f; 

we can use the inequality (57a) from Lemma to get 


p-i|||no(A®A)|||2 < (1 + e)|||A|||^ + e|||A|||2 < -(1 + 46)|||A|||2, (58) 

With the constant e sufficiently small, we get that 2|||nf2(A(8)A)|||p < |||nn(i^* 0 A + A (8) L"*) |||f and 
Ti > |||A|||2(y^2(l - e) - ^VlT^e) (^2(1 - e) - > ^ll|A|||?. (59a) 

On the other hand, we have 

\T 2 \ < ’^|||nn(^)|||op • MM - M* + A®A|||f < ^|||no(L;)|||op • |||A|||f. ( 59 b) 

p p 


Combining inequalities (59a) and (59b) with our original decomposition yields 

II'^IIIf) 


{{VCniM), M-M* + A®A)) > 4iI|A||| 2 - l^|||nn(ii;)|||c 

25 


p 


showing that the local descent (49) for Cn holds with a = ^ and = ^^*^^ |||no(iii)||| 


Local Lipschitz and smoothness: Observe that a < L = f3 and max{p, £„} < 1, ||F —F'|| 2 ,oo < 


6,/^ for all F,F' E F, and 


||V£„(F)F|||f = sup {{VCn{F0F), G0F)). 
GeIR‘ix’',|G|||F<l 


Using these facts, it follows that the Lipschitz and smoothness conditions (51)“(52b) for £„ can be 
verified by showing that 

<^(/3|||F-F'|||F + a£n|||77|||op)|||G|||F, (60) 
valid for all F,F' E O 182 ( 1 ; .^*), and for all matrices H,G £ such that ||iL|| 2 ,oo < 


{{Un{F®F - F'®F'), G ® H)) 


{{UniE), G®H)) 

P 


P 


Let us now verify the bound ( |60[ ). For an arbitrary F* E S{M*), define the matrices A = F' — F, 
Ai = F — F* and A 2 = F' — F*, and observe that 


lIln{F®F - F'®F') 

Vp 


_ |||nQ(F* ®a + a®f* + a®A 2 + Ai 

Vp 

^ |||nn(F* 0 A)|||F |||nn(A0A2)|||F , |||no(Ai ® A)|||f 


Vp 


Vp 


Vp 


Ti 


T 2 


T 3 
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Lemmaimplies that Ti < (1 + €)|||F* (g) A + A (g) -F*|||f < 2(1 + e)|||A|||F, whereas inequality (57c) 
from Lemmaensures that max{r 2 ,T 3 } < 6^2//r|||A|||F. Combining these bounds yields 


|||nn(F8F-W)|||, ^^ 

Vp 

On the other hand, using the inequality (57b) from Lemma we have 

G®H))\ < . \ip-y^Un{F®F - F'®F')i7|||F|||G|||i 

< 6^\lUn{F®F - F'®F')|||f|||G|||f. 


Combining with the earlier inequality (61) yields 

\p-\{Un{F^F - F®F'), G ® H))\ < 168/xr|||F - F'|| 

Finally, observe that 

\p-\{Un{E), G0H))\ < ^|||no(i?)|||op|||G®F|||_ < ^|||no(S)|||op|||G|||F|||i7|||op. 

p p 


(61) 


Combining the last two inequalities establishes the claim (60), thereby completing the proof of 
Corollary [T} 


6.3 Proof of Corollary 

We now prove our claims for the sparse PCA model. Dehne the sampling noise matrix W : = —E, 

corresponding to the deviation between the sample and population covariance matrices. Recall 
that E = 7 (F*0F*) + Id with |||F’*|||op = 1- Under the spiked covariance model , we have 
V/ln(0) = —E„ = — (E + W). Let R index the non-zero rows of F*] observe that the choice of R 
does not depend on the choice of F* in £{M*). 

In light of Remark]^ we have £{M*) C which guarantees the M*-faithfulness condition. It 
remains to verify the local descent, Lipschitz and smoothness conditions. 


Local descent: For a given matrix F, let Fj^* be its projection onto £{M*), and deifne A = 
F — Ftt*- Since F^.(g)F(r* = F*(g)F*, we have 

((V£n(M), M-M* + A®A)) = -((E, M - M* + AoA)) -((IF, M - M* + A®A)). 

'-V-''-V-' 

Ti Ta 

The remainder of the proof consists of lower bounding Ti and T 2 . 

Beginning with Ti, observe that 

Ti = -{{-fF^*®F^* + /, ® A + A (g) + 2A (g) A)) 

= -27((F^., A) + |||ATf^,|||2) _ 2((F,*, A)) - 2|||A|||2. 
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By Lemma the matrix F'^F.^* is positive semidefinite, and has operator norm bonnded as 
IIIL’''~iLr* flop < III-^111 op III -^TT* flop < 1, SO that the matrix — A''~iLr* = F — F~^ Ft^* is also positive semidef¬ 
inite. We therefore have the bonnd 

|||A’^F^*|||p < |||A’^LV.|||„,,|||A’^F^*|||„p = -trace(A’^F^*) • |||A’^F^*|||„p. 

Combined with the bonnd |||i^|||op < 1 and the orthonormality of F-,^*, we find that 

A) = F) + |||F^*|llp > F) + ^|||F,*|IIp + ^III^IIIf = 

It follows that 


ri>7|||A|||2(l-|||A^F^*|||op) + 


- 2IIIAII 


^ > 
F — 


(7^^ - l), 


where in the last ineqnality we nse ||| A'^'F^-* |||op < |||A|||op < 1 — r^. Combined with the assnmption 
7 > it thns follows that Ti > ii^|||A|||p. 

In order to bonnd T 2 , we reqnire control on how the matrix W behaves when acting on matrices 
in the set C(fc) :={U e \ \\U\\ 2 ,i < Vfc|||t/|||F}. 


Lemma 6. There is a universal constant c > 0 such that 

|((W, U (g) C))| < c l)|||?7|||f |||17|||f, for all U,V e C{k) (62) 

with probability at least 1 — 2d~^. 

See Appendix [D] for the proof of this lemma; it is based on variants of techniqnes from Lemma 12 
of Loh and Wainwright [46] . 

Now observe that the row sparsity of F* implies F* G C(fc), VF* G £{M*). Recall that R is 
the row snpport set of F*, with denoting its complement. Since F G F, we are gnaranteed 
that ||F|| 2 ,i < ||F*|| 27 , which implies the cone ineqnality ||A 7 ^c ||27 < ||A/j|| 2 ,i. By assnmption 
|F| < k, whence ||A ||27 < \/fc|||A|||p. It follows that A G C{k). Applying Lemma we find that 
with probability at least 1 — 


iTsI = |2((W, F^* ® A + 2A® A))| < 2|(F^*)^W^A|+2|A^WA| 

<4cma.{yL^,A5-h(, + l)C?ll|A|||.. 

t V n n ) 

Combining the bonnds for Ti and T 2 proves that the local descent condition (49) for is satisfied. 


Local Lipschitz Let ns verify the relaxed Lipschitz condition (50). Observe that for all matrices 
F G Fn]B2(/9;F*), F' G F and F* G we have F = F* + (F - F*) and 

F -F' = {F -F*)- (F' - F*). 

Following the argnment above one can show that the three matrices F*, F — F*, F' — F* all belong 
to the set C{k). Conseqnently, Lemmagnarantees that with probability at least 1 — 8 d~ 

\((W, (F - F') ® F))| < 2cmax{yL^, ^^((7 + 1) • {|||F-|||p + |||F - F*|||p) 


7-4 


||F-FN||p + |||F'-F* 


< 12c max 
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where in the last inequality we use |||-F' — -F*|||f < \/r(|||-FHop + |||-P"*|||op) < 2y^. It follows that 

\{{VC4F), (F-F')®F))|<|((S, {F-F')®F))\ + \{{W, {F - F') ® F))\ 

<|||SL,V^|||F-F'|||p|||FL, + ^ae„ 

< 2(7 + l)Vr\lF - F'IIf + 

< 2(7 + 1)V^(1 + |||F - F'IIf), 


where the last inequality follows from < 1 and a < 2(7 + l)\/r. Thus, we have established the 
relaxed Lipschitz condition (50) for Cn- 


Local smoothness: Since V£„(F®F) —VCn{F'®F') 
for Cri\s satished trivially. On the other hand, we have 


0, the hrst smoothness condition (52a) 


|((V£„(M), [F - F*) ® {F' - F")))\ < K(S, (F - F*) ® {F' - F")))| + \{{W, {F - F*) ® {F' - F")))\ 

< (7 + 1)V^\IF - F*|||f|||T' - F"|||f + K(1T, {F - F*) ® (F' - F")))|. 


Following the same argument above, we can show that F — F*,F' — F*,F” — F* G C{k), whence 
Lemma guarantees that 


|((VF, (F-F*)0(F'-F")))| <c(7 + l)max{y 


klogd klogd' 


n n 


I • |||F - F*|||f(|||F' - F*|||f + |||F" - F*|||f) 


(0 


• F - F* 


— g Ct^ni 

with probability at least 1 — 8 d~^. Here step (i) follows from the inequality max{|||F' — F*|||f, |||F" — 
T*|||f} < y/r, valid for any pair F',F” G M 2 {p]F*). We conclude that 

\{{VCn{M), {F - F*) ® (F' - F")))| < (7 + " ^*IIIf|||T' - F"|||f + J«en|||F - F*|||f, 

o 

thereby establishing the second smoothness condition (52b[) for Cn- 


6.4 Proof of Corollary 

We now prove our claims for the planted densest subgraph model. Since £{M*) is a two-element 

set, it follows that for any vector F G Fn]B2(p; F*), the projection F^r* : = arg min |||F* — F|||f 

F*&S(M*) 

is always equal to the cluster membership vector. The M*-invariance of the set T thus follows. 
Under the planted densest subgraph model, the expectation of the shifted adjacency matrix S has 
the expression 

S : =E[S] = ^^|2F*®F* - l®l}, 

where 1 G denotes a vector of all ones. The noise matrix W : = S — S has i.i.d. zero mean 
entries with variance bounded by p. The gradient of Cn is given by VCn{F®F) = —2SF. Below 
we verify the local descent, Lipschitz and smoothness conditions. 
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local descent: We have the decomposition 

{{VCn{M), M-M* + A®A)) = -2((5, ® A + A ® A)) -2((W, ® A + A ® A)). 

'-V-' '-V-' 

Ti Ta 

We proceed by lower bounding Ti and upper bounding \T 2 \. 


Beginning with the term Ti, for any feasible deviation A, the two matrices —S and F^A"*^ have 
the same sign on each entry, whence 


-2((S, F* ® A)) = {p- q)\\F* ® A||i > (p - q)k\\Ah. 


On the other hand, the bounds |||A|||f < ||||F*|||op = and ||F||i < k = ||F*||i imply that 
||A||i < 2\/fc|||A|||F < |fc, from which it follows that 

|2((F, A(8)A))| < 2||5||oo||A(8)A||i = {p - g)||A||f < ^{p - g)A:||A||i. 

5 

Putting together the piecesr, we obtain the lower bound Ti > g(p — g)A:||A||i. 

Now turning to term T 2 , by Bernstein’s inequality and Proposition[^ there is a universal constant 
Co > 0 such that 

\\W F* Woo < Co s/pkio^ and |||hP|||op < CQs/pd + pk log d, 
with probability at least 1 — d~^. On this event, the term T 2 can be bounded as 


|r2|<2(||WF*||oo||A||i + |||WLp|||A|||2) 

< 2co[\/pk logd||A||i + yjpd + pk log d||| A|||2) 
{ii) I 


where the step (ii) follows from the clustering condition (27), as well as the upper bound |||A|| 


F < 


|A||i < 


|i, using the fact that || A||oo < 1- Combining the bounds for Ti and T 2 , we conclude 


that 


((V£„(M), M-M* + A®A)) > ^k{p - g)||A||i > ^k{p - q)|||A|||2, 


thereby establishing the local descent condition (49) for Cr 


Local Lipschitz and smoothness: Since VCn{M) —VCn{M') = 0, the hrst smoothness condi¬ 
tion (52a) is satished trivially. It remains to verify the second smoothness condition (52b) and the 
relaxed Lipschitz condition (50). For each F, F',F" G F and G G observe that 

< K(F, {F-G)®{F' -F' 


|((V£„(M), {F-G)®{F'-F" 


+ K(W, {F-G)® (F' - F" 

' '-V- 

F 
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Note that the matrices F\F" G F satisfy the constraint 'Y^^F' = which implies that 

(1(8)1) (F' — F") = 0. It follows that Ti can be upper bounded as 


Ti = 


p-q 


{{2F*®F* - 1®1, {F-G)® {F' - F"))) 


= {p-q) {{F*®F*, {F-G)® {F' - F”))) 
<(p-(?)|||F*|||2|||F-G|||p|||F'-F"|||p 
= 2{p-q)kl\F-GU\F'-F''\y. 


Similarly, the second term can be upper bounded as 


T2 < 
(b 


IIF- 


\\F' -F" 


<coVpd + pklogd\lF - G|||p|||F' - F" 

(ii) T 

<-fe(p-g)|||F-G|||p|||F'-F"|||p, 


where inequality (i) holds with probability at least 1 — d ^ as proved above, and inequality (ii) 


holds under the clustering condition (27). Combining the bounds for Ti and T 2 with the choice 
j3 = 12{p — q)k, we conclude that 


\{{VCn{M), (F-G)®(F'-F" 


< ^|||F- 


IIF' -F" 


VF, F', F" eF,Ge 


\)dxr 


For an arbitrary F* G £{M*), setting G = F* in this inequality establishes the smoothness condi¬ 
tion (52b). On the other hand, setting G = 0 and noting that |||F|||| < ||F||i||F| 


< k = IIIF^ 


'* III 2 


we obtain the relaxed Lipschitz condition (50) for 


6.5 Proof of Corollary 

We now prove our claims for the one-bit matrix completion model. By assumption, the initial matrix 
F^ belongs to the set B 2 (g; F*) OF, where the set F is was previously involved in our analysis of 
ordinary matrix completion (see Section 6.2). Therefore, following the argument in Remark]^ we 

can show that ||F|| 2 ,oo < 11-^ “ lb ,00 < C* G £{M*) and F G B 2 (e; F*) n F, 

and that F is M*-faithful. Consequently, we have for all relevant matrices M* and M, we are 
guaranteed that 


max 


{||M7a||oo, ||M/a||oo} <4 


pr 

da 


Now dehne a (random function) H : —>■ with entries [H{x)]ij 

With this notation, we have 


/'(x)(-y,,-+2/(x)-i) 

/(a:)(l-/(3:)) 


V£„(M) = ^ no [H{M/a)] . 

(7 


(63) 
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For future reference, we claim that each component [H{-)]ij is bounded and 4L4,y-Lipschitz over 
[—4i^,4i^]. This property follows because / satisfies the bounds in equation (29) and \Yij\ < 1, so 
that |[i7(x)]jj| < 2L4jy surely. Moreover, the derivative can be bounded as 




f”{x){Yij - 2f{x) + 1) - 2 f'{xf f'{x){Yij - 2/(x) + _ 2 f{x)) 


f{x){l- f{x)) 
which certifies the Lipschitz property. 


Pix){l - /(x))2 


< 4ii4i/ 


With this set-up, we are now prepared to establish the local descent, Lipschitz and smoothness 
conditions for Cn- 

Local descent: Let us introduce the shorthand G{M) := K[VCn{M)] = VE[/i„(M)]. We be¬ 
gin by splitting the gradient into two terms, corresponding to the expectation and the zero-mean 
deviation, thereby obtaining 

((V£„(M), M-M* + A0A)) = ((G(M), M - M* + A® A)) + {{VCn{M) - G{M), M - M* + A®A)). 

'-V-" '-V-' 

Ti Ta 


Controlling Ti: For the expectation term Ti, we first note that E[£„(M)] is convex in M and has 
the form of expected negative log likelihood, whence 


((G(M), M - M*)) > E[/:„(M)] - E[£,(M*)] 

= pD{f{M*/a)\\f{M/a)) 

> p /a)\\f{M/a)), 


where D{-) and 
known lower bound (Lemma 2 in the paper 
such that ||A||oo, ||^^||oo < O) we have 


denote the KL and Hellinger distances, respectively. To proceed, we use a 

on the Hellinger distance: for matrices X,X' G 


24 


4(/(x),/(x'))>^^^. 


(64) 


Since ||M/(t||oo, ||Tf*/(T||oo < 4z^, applying the lower bound (64) with a = 4^ yields the lower bound 
((G(M), M-M*)) > ^ -|||M — M*|||p. Furthermore, since F* is orthonormal, |||A|||op < and 
Ft,-* is symmetric, we have 


((G(M), M - M*)) > 




Ay 


On the other hand, using the expression (63) for the gradient and the assumptions in equation (29), 
we find that 


ll|G(M)|||p 


p 

a 


f'{M/(j) o {f{M/cr) - f{M*/a)) ^ py/L^ 

/(M/fj) o (1 -/(M/cr)) cr 


|||/(M/a)-/(M7a)|||p. 
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Inequality (29) combined with the bound sup^g® \ f{z){l—f{z))\ < 1 ensures that / is -v/I^-Lipschitz 
over the interval [—4i^, 4i/]. It follows that 


|||G(M)|||p < ^|||M - M*\y < ^IIIAIIIp, 


(T^ 




(65) 


where the last step uses the upper bound |||A|||op < 1. Combining these bounds, we obtain the 
following lower bound on the expectation: 


Ti = ((G(M), M-M* + AOA)) > ((G(M), M - M*)) - |||G(M)|||f|||A 

P III A III 2 III A . A III 2 


> 

> 


P 


II A, III 2 ^pLii, 

11^ III F 




op ■ III A Ip 


32a^i4^ 

where the last step uses the bound |||A|||op < ■ 


|A|| 


Controlling T 2 : We now turn to analysis of the deviation term T 2 . Using the symmetry of the 
matrix VCn{M) — G{M), we may rewrite it as 

T 2 = 2{VCn{M) - G(M), A ® F) 


We control this quantity via two auxiliary lemmas. For each B G (0,1), define the annular set 


B 


~ I 17 F III^IIIf — 11^112,00 — 


\2{VCniF<S)F) - G{F0F), A ® F)| 1 

- > -as. 

4 


and the event 

}■ 


S{B) : = I sup 
Aer(B) 

Our first lemma controls the probability of this “bad event”: 

Lemma 7. For any B G (0,1), we have P[<5(i?)] < 2d~^‘^. 

Our next lemma gives controls over the small ball Fq around the origin. In particular, we define 
the set 


Fq : = {A I I A|f < 2 , IIAII 2 < 4z^(t}, and £():=< sup 

’ Aero 


|2(V£n(F®F)-G(F®F), A®F)| ^ 


IIAII 


Lemma 8. We have P[£^o] ^ d 

See Appendices |E . 1 1 and |E .2 1 for the proofs of these two auxiliary results. 
Taking the lemmas as given, the union bound then guarantees that 


sup 

II|A|||f< 1,||A||2 <4Fa 


|2(V£„(F®F) - G(F®F), A ® F)| 1 

-iiAiiir.. 4 “"^ 




<P[To] +J^P[f(2-')] 
i=0 

< cf • 2 d -^^ + ^ 3^-12 
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which implies that IT 2 I < |aen||| A|||p with probability at least 1 — 3d Conditioned on this event, 
we can combine the bounds for Ti and T 2 to conclude that 

((V£„(M), M - M* + A®A)) > Ti - jTsj > 2a|||A|||2 - ^ae„|||A|||p, 


valid for all matrices A such that |||A|||p < p = max{j|^, }; thereby establishing the local 

descent condition (49) for 


Local Lipschitz condition (51) and smoothness condition (52b): We begin by making note 
of the bounds 


IF' - F" 


|2,oo S 


<4W^, and |||F'- F"|||p < 2, 
d 


valid for all matrices F',F" G F nB2(/9;F*). Moreover, we have a < L = /3 and < 1. Using 
these facts, it follows that the local Lipschitz and smoothness conditions (51) and (52b) for Cn are 
implied a bound of the form 


||V/:n(M)F|||p < 

o 


||F-F*|||p + aen|||F|||p), 


( 66 ) 


valid for all matrices F G J-fW> 2 {p] F*), and F* G S{M*), and matrices H such that ||F|| 2 ,oo < dy 
Accordingly, the remainder of our effort is devoted to establishing the bound ( |66[ ) . We hrst condition 
on the event in Lemma which occurs with probability at least 1 — 2d~^. We then make note of 
the decomposition 

|||V£4M)F|||p = -|||no[h(M/c7)]F|||p < -|||no[h(M/a) -h(M7c7)]F|||p + -|||nf2[h(M7a)]F|||p, 

a a a 


Ti 


T 2 


and we bound each of these two terms separately. 

Bounding term Ti: We have 

Ti = ^ sup I {{Un [H{M/a) - H{M*/a)], G 0 H))\ 
a GeK‘^>'^:|G|||p=i 

<-|||no(F(M/7-F(M77)|||p- sup |||nn(G®F)|||p. 

cr GelR‘*>'’-:|G|||p=i 

Recall that each component [Lf(-)]jj is almostly surely 4F4,y-Lipschitz over the interval [—4z^, 4zz], 
and that ||M/it||oo, ||Af*/iT||oo < 4zz. Combining these facts yields the bound 

|||nn(F(M/7-F(M7a))|||p < ^|||nn(M - M*) |||p 

< ^(2|||nn(F* ® (F - F*))|||p + |||nn((F - F*)®(F - F*))|||p) 

^ 72Li^^y2ppr III ^ III 
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where the last inequality follows from the inequality (57c) in Lemma combined with the fact that 
max I||F*|| 2 ,ooj \\F — -P"*|| 2 ,oo} < 4 a/A pplying inequality (57c) a second time yields 


sup IIIq/G (g) -f7)|||F < Qyjlpixr sup IIG'IIIf = Q^/2p^r. 

GeK‘ix’':|G|||F=l GeR‘^x’';|||G|||F=l 

Putting together the pieces yields the upper bound 

^ 48£4.PA.r . g 

8 


(67a) 


Bounding term T 2 : Turning to the term T 2 , we observe that conditioned on Y, the matrix H{M*/a) 
is a deterministic quantity with entries bounded uniformly as \[H(M*/a)]oj\ < ^ 4 ,^ for all indices 


i,j G [d]. By applying Lemma 11 and integrating out the conditioning, we hnd that the second 
term is bounded as 


T2 < -\lUn[hiM*/a)]\U • MHh < 

a a 

where these bounds hold with probability at least 1 — d~^. 


||-f7|||F < ■^ae„|||iL|||F, (67b) 

o 


Finally, combining our bounds for Ti and T 2 in equations (67a) and (67b) yields the desired 


bound ( 66 ). 


Local smoothness condition (52a): We begin by conditioning on the event in Lemmawhich 
holds with probability at least 1 — 2d~'^. For each pair of matrices F', F" in the set F n B2(/9; F*) 
and any matrix F* G we then have 

|((V£„(M) - VCn{M'), F' ® (F - F*)))| = ^\{Un{H{M/a) - h{M'/a)), F' ® (F - F*))| 

< hun{H{M/a) - h{M'/a))UUn{F' ® (F - F*))|||f 


< 


a 
4L/4j/ 


- 


|||no(M-M')|||F|||nn(F'® (F-F* 


l|F) 


where the last inequality follows from the fact that the function H is element-wise 4 F 4 iy-Lipschitz. 


Recall equation (61) from Section |6.2[ which ensures that 

|||nn(M-M')|||F < 14y^|||F-F'|||F. 


Combined with inequality ( 57c[ ) from Lemmawe hnd that |||nf 2 (F' (g) (F — F*))|||f < 6^/2p^r\lF — 
F*|||f, from which it follows that 

K(V£„(M)-V/:„(M'), F'®(F-F*)))| < ^^^^•p/xr|||F-F'|||F|||F-F*|||F 


(7^ 


< 1/3|||F-F'|||f|||F-F*|||f, 


thereby establishing the local smoothness condition (52a). 
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6.6 Proof of Corollary 

We now prove our claims for the matrix decomposition problem. By dividing through |||P*|||op) we 
may assume without loss of generality that |||P*|||op = 1- The set T and the values of p and d(F^, F*) 
are the same as used in the proof of matrix completion in Section 6.2, so we make use of the results 
therein. In particular, we showed there that the set F is M*-faithful. 

Given the observation matrix Y = M* + S* + E, the gradient takes the form 

V£n(M) = (M - M*) + {S{M) - S*) - E, 

where S'(M) : = Ils{Y — M). Below we verify the local descent, Lipschitz and smoothness conditions. 


local descent: Expanding VCn{M), the quantity {{VCn{M), M — M* + A( 8 )A)) can be decom¬ 
posed into the sum 

|||M - M*\ll + {{M - M*, A(8)A)) {{S{M) -S*, M - M* + A^A)) + {{-E, M - M* + A^A)). 

" -V-" '-'-V-' '-V-^ 

Ti Ta Ts Ti 

Note that M — M* = F-,^* (g) A -|- A (g) En-* + A®A. By Lemma the matrix AJF.,^* is symmetric, 
so expanding the Frobenius norm shows that Ti > 2|||A||||. Since |||A|||op < |, we have 

|T2|<2|||ALp|||A|||2 + |||A|||2J||A|||2<||||A|||2. 

With As : = S{M) — S* and ej being the j-th standard basis, we find that 

iTsI = |2((A5, A®F))\ <2|||fTA5|||f|||A|||p 


= 2 . 


^||FTA5e,-||2|||A|||F <2, 






2,00 


ll^se. 


j 111 


lAII 


F) 


where we use the symmetry of As in the first equality. Inequality ( |55a[ ) ensures that ||E|| 2 ,oo < 

Moreover, for each S' G 5, we have the inequalities HAgejUi < 2 ^/k\\Asej\\2, j G [d] thanks to the 
row-wise ii constraints and the fc-sparsity of the columns of S*. It follows that 


inl <4,/^, 


El|Asej|lil||A|| 

\ i=i 


= 4 


prk 

~T 


II|A 5 |||f|||A|| 


Under the assumption < ci of the corollary, we obtain IT 3 I < ^|||A 5 |||f|||A|||f. But S* G S and 
the projection II^ is non-expansive, whence 

IIIA^IIIf = |||n5(T - M) - S*|||f < |||(T - M) - S*|||f = HIM - M*\y < 3 |||A|||f, (68 ) 

and so we have shown that IT 3 I < JgllAlp. Finally, we have 

\T4\ < WEMM -M* + A0A|||f < y||| f;|||f|||a|||f. 

Putting together the bounds for Ti, T 2 and T 3 and T 4 , we conclude that 

II Fill A If 


1 16 

{{VCniM), M-M* + A®A)) > -|||A |||2 - — 

g) O 


F, 


thereby proving the local descent condition 


for 


47 


















Local Lipschitz condition: 

have 


Using the inequality (68) above and the assumption |||i?|||F < g, we 


V£„(M)F|||f = |||(M -M* + S{M) - S* - £;)F|||f 

<(|||M-M*|||F + |||A 5 |||p + |||i5;|||p)|||F|||„, 
<( 6 |||A|||f + 1)|||FL, 

< 8 , 


where the last inequality follows from 
condition in (51). 


op< 


< 2 
F - 5- 


Therefore, Cn satisfies the local Lipschitz 


Local smoothness: Observe that 


\{{VCn{M) - VCniM'), F'®{F- F*)))\ = \{{M - M'+ S{M) - S{M'), F'0 {F - F*)))\ 

< (|||M - M'If + |||5(M) - 5(M')|||f)|||F - F*|||f|||F' 

The non-expansiveness of the projection II^ ensures that 

|||5(M) - 5(M')|||p = \lUs{Y -M)- Us{Y - M')|||p < |||M - M'|||p < ^|||F - F'|||p,. 

5 

where we use F,F'€M 2 {^;F*). It follows that 

\{{YCn{M) - S/CniM'), F'0iF- F*)))\ < 12|||F - F'|||p|||F - F*|||p, 


proving the first smoothness condition (52a). 

Similarly, combining inequality ( 68 ) with the bound |||L'|||p < | implies that 


|((V£„,(M), {F - F*) 0 {F' - F")))\ = \{{M-M* + S{M) - S* - E, {F - F*) 0 {F' - F")))\ 

< (|||M - M*|||p + |||5(M) - 5*|||p + |||^|||p)|||F - T*|||p|||F' - F" 

< 7|||F-F*|||p|||F'-F"|||p, 


thereby verifying the second smoothness condition (52b). 


7 Discussion 

In this paper, we have laid out a general framework for analyzing the behavior of projected gradient 
descent for solving low-rank optimization problems in the factorized space. We have illustrated the 
consequences of our general theory for a number of concrete models, including matrix regression, 
structured PCA, matrix completion, matrix decomposition and graph clustering. 
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A Proof of Theorem 


Recall that in Section 


5.2 


we proved Theorem nnder the assnmption that Cn satisfies the relaxed 
local Lipschitz condition (40) as well as the local descent condition (14) and smoothness condi¬ 
tions ( |l 6 | ) . We establish Theorem by showing that these conditions for Cn are implied by the 
corresponding conditions for Cn in Definitions [5}|^ with the same parameters a, /3, L, and p with 
p < ar{F*). 


Let F be an arbitrary matrix in F n]B2(/5;T*) and F* an arbitrary member in £{M*). Since 
p < ar{F*), Lemmagnarantees that = argmin^g£(jy^.) |||^ — T|||p is nniqnely defined. We nse 
the shorthand G : = V G : = VA^* : = F — TV* and A = F — F*. 

Local descent condition: Observe that 

{{G{F), F - F*)) ={{G{F0F), A ® F + F ® A)) 

=((G(F®F), F®F - F*®F*)) + ((G(F®F), A ® A)) 

=((G(F(8)F), F®F - F^*(8)F^*)) ((G(F(8)F), A^* 0 A^*)) 

+ ((G(F®F), (F^* - F*) ® A)) + ((G(F®F), A^* ® (F^* - F*))), 


where the last step follows from F*0F* = Fn-*( 8 >F 7 r* and A = A.^-* -|- F^-* — F*. We then apply 


the local descent condition (49) for Cn to the first two terms above, and the local smoothness 


condition (52b) for Cn to the last two terms. Doing so yields 


((G(F), F - F*)) >2a|||A^*|||2 - ^|||F^* - F* 


IAIIIf + IIA7r*|||F) ~ ■X^nilA. 


« III A II 

- 2 ^^ 111^11 


>2a|||A^*|||2-^|||F^* -FViipi 


A. 


f3 


— — ilFn-* — F*|||^ — aSnlA. 


n ^77* F 


— —e IIIF * — F* 

2 III TT* ^ 


>2a|||A^*|||2 - (^a|||A^.|||2 + ^|||F^* - F*\ll^ 


- ?||| F ^* - F *|||2 


- ( -Q;|||Aff*|||p -I- 2 “^n) “ ( 2 “^^ + ^lll-^vr* - F*|||p 
/32 


>a|||A^*|||^-^|||F^* -F*|||p-ae^, 


a 


where the last three steps follow from |||A|||p < |||A 7 r*|||F -|- |||Fn-* — F*|||p, the AM-GM inequality and 
the upper bound a < /?, respectively. This proves the local descent condition for Cn in (14). 


Relaxed local Lipschitz condition: Let F' be an arbitrary matrix in F. With G(F( 8 )F) sym¬ 


metric and Cn satisfying the relaxed Lipschitz condition in (50), we have 


|((G(F), F - F'))| = |((2G(F®F), (F - F') ® F))| < L(|||F*|||2^ + |||F*|||p|||F - F'|||p), 


which proves the relaxed local Lipschitz condition for Cn in (40). 


49 










Local smoothness condition: Let F' be an arbitrary matrix in F n]B 2 (/ 0 ; F*)- The smoothness 
conditions (52a) and (52b) yields that 


\{{G{F) - G(F'), F - F*))\ = \2{{G{F®F)F - G{F'®F')F\ F - F*))\ 

= \2liG{F®F)F' - G{F'®F')F', F - F*)) + 2((G(F®F)(F - F'), F - F*))\ 
< (3\IF - F'IpIIIF - F*|||p + ae„|||F - F*|||p, 


which establishes the smoothness condition for Ln in (16). 


B Technical lemmas for Corollary 

In this appendix, we prove the technical lemmas involved in the proof of Corollary on the matrix 
sensing model. 

B. l Proof of Lemma [3] 

By the bilinearity of the inner product, we may assume without loss of generality that |||^|||p = |||i?|||p = 1. 
Since the matrices AF B have rank at most 4r, the RIP with < 54 ^ ensures that 

(1 - <54r-)|||A ± Slip < hXn{A ± B)\ll < (1 + h4r-)P ± Bjl 
n 

It follows that 

Xn{B))) = ^ + B)l\l - - S)|i2) 

<- ^(1 + 54r)|||^ + S|||p — (1 — (54r)|||^ — S|||p^ 

= {{A, B)) + -h4r(||^||F + Iis||p) 

= ((A, R)) + ,54.|iA|ip|||S|ip. 

It follows from a similar argument that ^{{Xn{A), Xn{B))) > {{A, B)) — h 4 r|||A|||p|||S|||p. 

C Technical lemmas for Corollary 

In this appendix, we prove the technical lemmas involved in the proof of Corollary on the matrix 
completion model. 

C. l Proof of Lemma |4] 

Define a subspace F C of d-dimensional matrices as follows 

T:= \^X\X = {F* 0U) + {V0F*) for some U,V £ 
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and let IIt- be the Euclidean projection onto T. Since F* is 4/i-incoherent, a known result in 
exact matrix completion 18 guarantees that as long as p > for a suficiently large universal 

constant c, then 


Knrnnnr - pnr)x\i^ < ep\ix\y, for aii x g r 

with probability at least 1 — 2d~^. Noting that the matrices F* HF* belong to the subspace 
T, we can apply the above inequality to obtain 

(l-e)p|||F*®/7±G®F*|||p < |||nrnnnr(E* ±G®F*)|||f < (l + e)p|||E*®/? ±G®F*|||p. 

The rest of the proof is similar to that of Lemma In particular, by the bilinearity of the inner 
product, we may assume |||T* ® Hfp = |||G (8> -E*|||p = 1. Using the above inequalities, we find that 

{{Un {F* ® F), Un (G ® F*))) = {{UnUr{F* ® H), UnIlr{G ® F*))) 

(|||nnnr(T* ^H + G^ F*)|||2 - |||nonr(F* ® F - G ® F*)\ll) 

(|||F* <s>H + G0 F*|||p|||nrnonr(T* <s>h + g<s> f*)|||p - |||F* <S)H-G0 F*|||p|||nrnonr(T* ® f + g ® f*)|| 

< J ((1 + e)p\lF* <s> H + G ® F*\ll - {1 - e)p|||F* ® iL - G ® F* 

=p{{F* ® F, G 0 F*)) + ep = p{{F* ® F, G ® F*)) + ep|||F* ® F|||p|||G ® F*|||p. 

This proves the first inequality in the lemma. The second inequality can be proved in the same 
fashion by noting that the matrices {F* (8) F) ± (F* (8) G) also belong to the subspace T. 


C.2 Proof of Lemma [5] 

We need the following result on random graphs |28| , which involves some universal constants ci and 
C2- 

Lemma 9. If p > then with probability at least 1 — 


Y, GiU, <(l + e)p||[/||i||U||i+C2V^||C/||2||U||2, VG,Ug 


(69) 


From the bound (69) and the assumption p > —h we find that with probability at 


least 1 — Id 


i V ^ i 

= {lFe)\lH\lt + G2^l\H\ll\\H\\Y 

4 I ,llli_rll|2 


< (1 + e)|||F|||p + e|||iL|||p, 


where the last step follows from the bound ||F^|| 2 ,oo < 
inequality (57a) in the lemma statement. 


We have thus established the first 
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Let Vti := {j I {i, j) £ Q,}. When p > for C sufficiently large, the event maxj 10*1 < 2pd 

holds with probability at least 1 — d~^. On this event, we have 

dr r, 

i=l k=l 

i=\ k=\ 
d 

i=i 
d 

< p-^ E l|no(^^)e*||i • (max |0*|) ||iL|||^ < |||no(Z)|||2 • 2d\\H\\l 

i=l 


But ||Lf|| 2 ,oo < ^ assumption, so the second inequality (57b) in the lemma follows. 

To establish the third inequality (57c) in the lemma, observe that conditioned on the event 
{maxj |0*| < 2pd}, we have 


d d 

El|no(77®G)e 

i=i 


<p ^Eii^l 

II? E 

i=i 

{*1 (*0)60} 


|| 2 ( max 10*1) ||iL 

3=^ 


d 

\\l ■ 2pd-36^ 

i=l 


= 72pr\lG\ll 


D Proof of Lemma [6] 

By rescaling, it suffices to consider matrices U and V with |||C/|||f = 111^1^ = 1 and U,V £ IB 2 ,i(Vfc)- 
We need the following geometric result, which is a simple generalization of Lemma 11 in the pa¬ 
per 


46 . For completeness, we provide the proof in Section D.l to follow. 


Lemma 10. For each integer 1 < k < d, we have 

B 2 ,i(V^) n ]Bi?(l) C 3cl{conv{]B2,o(fc) n Bi?(l)}}. (70) 

Based on this this lemma and continuity, it suffices to prove the bound ( [6^ for pairs of matrices 
U,V £ conv{]B2,o(fc) n Bi?(3)}. Any such pair can be written as a weighted combination of the form 
U = ^ with weights ai, f3j > 0 such that — Ylj (^j — 

constituent matrices Ui, Vj £ M 2 fi{k) H Bj7’(3) for each i,j- With this notation, observe that 


I ((W, [/ ® y)) I < E W/3i I {{W, Ui ® Vj)) I < ( E I ® I = I ((W, Ui ® 
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If we use {Ui).( and {Vj).£ to denote the £-th column of Ui and Vj, respectively, then 


\m U^ ® y,))| < IW m.e ® iVj).e) 


e=i 


< 


(i) 

< 9 


sup |K| ^ \\{Ui).eh\\{Vj).e \\2 

x,y&o{k) Ikiblli/lb/ ^ 


sup 


\x'^Wy\ 


fc,2GBo(fc) ||^||2||?/||2 


i=l 


)=K 


„T 


sup \x 
*,?/sBo(fc)nB2(i) 


where step (i) follows from the Cauchy-Schwarz inequality, and |||17i|||p = |||V^ |||f < 3. It suffices to 
bound the supremum in the last RHS by 18t, where 


, f Iklogd klogd\ 


(71) 


for a universal constant c' to be specified later. 

To proceed, we make use of a standard concentration result. Recall that X G is the matrix 
of independent samples from a d-dimensional Gaussian distribution with zero mean and covariance 
S = 7 (T*( 8 )T*) +1^. By Lemma 15 in Loh and Wainwright |46] , there is a universal constant c > 0 
such that 


sup \\\Xz\\\/n — z^T,z\>t < 2exp ( — cnmin I 
2eBo(2A:)nB2(l) ^ ^ 


(7+ 1)2’7 + 


-] +2k log d^ . 


Applying this inequality with z = g (x ± y) and our previously specified © of t with c' = |, we 
find that with probability 1 — we have 

^\-\\X{x + y)\\l - (x + y)’^S(x + y)| < t and ;^|-||X(x - y)\\l - {x - y)’^S(x -y)\<t 
6b n 6b n 

for all x,y € Bo(A:) n 182(1). On this event, we have 

-x'^X~^Xy = -\\X{x + y)\\l - -\\X{x-y)\\l 
n n n 

< (x + y)’^S(x + y) - (x - y)’^S(x - y) + 72t 
= 4x’^Sy + 72t 


and similarly ^x~^X~^Xy > 4xSy — 72t, whence |x''~iyy| = \ ^x'^X~^Xy — x''~Sy| < 18t. 


D.l Proof of Lemma 1101 

Let A,BC be closed convex sets, with support function given by (j)A{U) = suppg^((T, U)) 

and (jiB similarly defined. It is well-known that <f>A{U) < (j)B{U) for all U G if and only 

if A C i?. Accordingly, let us verify the first condition for the sets A = B 2 ,i(\/fc) nBi?(l) and 
B = 3 cl{conv{B 2 ,o(fc) nBi?(l)}|. 

For any U G let S C {1, 2,..., d} be the subset that indexes the top [k\ rows of U in £2 

norm. Then ||C/s'c||2,oo < IIII 2 for all j G S, whence 

\\Us42,oc < ^^\\Ush,i < (72) 
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Therefore, we obtain 


MU) = snp{{Fs, Us)) + ((Fsc, Us4 < sup {{Fs, Us)) + sup {{Fsc, Us4 

IITsII|f<i ||-F’s=ll2.i<Vfc 


< |||C/s|||F + \/fc||C/sc||2,oo 



where inequality (i) follows from the earlier bound ( |72[ ). The claim then follows from the observation 
that 4>b{U) = sup((F, U)) = 3 ma^ Mi^T, Ut)) = 3|||t/5|||p. 


E Technical lemmas for Corollary 

In this appendix, we provide the proofs of the technical lemmas required for Corollary on one-bit 
matrix completion. 


E.l Proof of Lemma 0 

In order to prove the lemma, we need to establish an upper tail bound on the random variable 


Z : = sup 
Aer(s) 


((V£„(F®F) - G(F®F), A ® F)) 


Define the indicator variable Qij := G 0} for each (i,j). 

VCni') and the definition of T{B), we observe that 


Using the expression (63) for 


Z sup M [Qijhij{Mij/a) - EQyQijhij{Mij/a)] • (A (g) 9)ij. (73) 

Uo- Aer(B) ^ 

By the usual symmetrization argument |^, the expectation Eq^y\Z] is at most a factor of two 
times the Rademacher-symmetrized version. This is the supremum of a sub-Gaussian process in 
terms of Rademacher variables, and so is majorized by the expected supremum of the corresponding 
Gaussian process up to a universal constant c. In conjunction, these two steps yield the bound 


^q,yZ < cEQ^Eg sup {^(A)} : =cEQMg sup 
Aer(B) Aer(s) 


4 

~Ba 


^ ^ dijQij^iji^ij/4 ■ G) 9)ij 
hj 


where {gtj} are independent standard Gaussian variables. 

Our next step is to bound Eg supAgr(B) ■^(^) using the Sudakov-Fernique comparison inequality. 
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For any A, A' G r(i?) and F' ■. = F* + A', M' : = F'®F' , we have 


7(A,A'): = E3(Z(A)-Z(A'))2 

16 




H 2 


'^dijQij {hij{Mij/a) ■ (A 0 F)ij - hij{Mya) ■ (A' 0 




16 


< 


32 


Q2, {hij{Mij/a) ■ (A 0 F),,- - h^jiM^a) ■ (A' 0 F%y 




E • (A 0 F - A' 0 F')| + • (A' 0 F')?,} . 




Recall that for each (i,j) and over the interval [—4z/,4z/], the function hij{-) is surely bounded by 
4 F4i^ and 4F4i,-Lipschitz. Moreover, the Cauchy-Schwarz inequality implies that 



• 2 



= iaiy. 


I®b'l = — ll-^i-lbllE'y II 2 < 2-, 

Note that the same bound holds for |Mb|, |(A 0 F)ij\ and |(A' 0 F')ij\- It follows that 

EQ?, {(A 0 F - a' 0 F% + {M,,/a - M[^/af • 


7(A,A')<C2^ 

[B(t) 




_ (^2 

{Bay 


E Qlj {(A 0 F - A' 0 F% + (2A 0 Fij + A 0 Aij - 2A' 0 F'j - A' 0 Ah)^ 




< 




E Qb {(A 0 F - A' 0 F')?,- + (A 0 A - A' 0 A%} 

^ hi 


We compare Z{A) with an alternative stochastic process given by 

Z(A) : = C(1 + [fl'*iQ*i(A0A)jj + g[jQij{A 0 F)jj] 




where {gij,g[-'\ are independent standard Gaussian variables. Both Z(A) and Z(A) are surely 
continuous in A. Observe that by independence, we have 

7(A,A'): = E,,,,(^(A)-Z(A'))2 

= C^(1 + E {(A®A - A'0A% + (A 0 F - A' 0 F%} , 




so we have 7 (A, A') < 7 (A, A'), VA, A' G r(F). By the Sudakov-Fernique comparison [4^ , we hnd 
that 


EQ^yEg sup Z{A) < Eg^yEg^g/ sup Z{A) 
Aer(s) Aer(B) 

I 3 Y 


C(1 + ^)^ • ^Q,9,g' sup E [5iiQii(A0A)jj + g-jQijiA 0 F)* 
Aer(B) 






< (7(1 + z^)— ( sup |||A0A|||„,, + sup |||A 0 F|||„„<, ) EQ,g||| 5 f o Q|||„p, 


\Aer(B) 


Aer(B) 
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where the last inequality follows from the generalized Holder’s inequality and that g and g' are 


identically distributed. To proceed, we use Lemma 11 to get that EQ^gl^f o Q|||op < c{^/^ + \ogd) < 
2c-v/^, where the last inequality follows from the assumption p > . Moreover, for each A G 

T{B), the matrices A® A and A0 F have rank at most r, so 

max{|||A(8)A|||„,,,, |||A (g) < y/r ■ ||| A|||f max{|||A|||„p, |||F|||„p} < 2^/rB. 


Putting together the pieces yields 

sup Z(A) < -L 4 ,,(l + i^)^/pdr < -aSn- 

Aer(B) o 

in order establish concentration of Z around Eg^yZ, we use a standard functional Hoeffding 


inequality 44 . In particular, letting be independent random variables such that Aj takes 


values in Aj, consider a random variable of the form Y : = sup where for each g £ G, 

g&G 

have supj.g;^’. |5(3^)I ^ Then we are guaranteed that 


P [y > E[y] + r] < e for all r > 0, and 

Setting r = we have 


Aer(B) 




Aer(B) 


< C 




iv 


2^2 


a‘^B‘^ 


sup III AI 
Aer(B) 


< 


a e. 


(74) 


128 X 14 logd 


T»2. 


Consequently, applying the bound (74) with these choices of (r, Zl^),Hwe obtain ¥[Z > EQ^y[Z] + 
gOEn] < Combining with the the expectation bound Eg^yZ < ^aen, we hnd that 


Z = sup {{VCn{F0F) - GiF®F), A))/|||A|||p > ]ae^ 
Aer(S) 4 


< d 


-14 


Following the same lines of argument we obtain a similar bound on the lower tail: 


inf ((V£4F®F) - G(F®F), A))/|||A|||f < -\aen 
Aer(B) 4 


< d 


-14 


The proof the lemma is completed by applying the union bound. 


®In particular, we apply it with the following setup: Xij = (d ®CjiQij, Tj) and Xij = {ei(8 )ej}x{0,l}x{—1 ,1}, 
where d is the i-th standard basis vector in R'*; T = {Ca : A G r(i3)} with 


Ca(a,h = 


,/'(M/a) {Y,{d ® e,) - 2/(M/a) + 1) f{M/a) (2/(M7a) - 2/(M/a)) 


<r|||A||h 


/(M/a) (1 - /(M/a)) 


/(M/a) (1 - /(M/a)) 


Qij{ei (g) Cj) o (A (g) F))). 
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E.2 Proof of Lemma [8] 

Using the Cauchy-Schwarz inequality and the expression (63) for the gradient VCn, we have 


sup 

Aero 


|((V£„(F 0 F)-G(F 0 F), A0F))\ 

IIIAIIL 


1 

< - 


sup 
Aero 


[UnhiM/a) - EUnhiM/a)] F|||p < VTs 


i=i 


where Ti : = f supAePo H i^nHM/a) - Ilnh{M*/a)] F|||f, and 

T 2 : = - sup III [Unh{M*/a)] F|||f, and Tg : = - sup ||| [Enoh(M/a)]F|||F. 

Aero 0- Aero 

To prove the lemma, it suffices to show that with probability at least 1 — each of {Ti,r 2 ,T 3 } 
is bounded from above by ^aSn- For Ti, we have 


Ti < 


2 h) 16d 

- sup d\\H{M/a) - H{M*/a)\\oo • |||-F|||op <- ■ sup ||M/ct - M*/a\\oo 

Aero <7 Aero 

16d2 ~^^(**) r log^ dL^i, 1 

< - 


where the step (i) follows from the fact that h is element-wise 4 L 4 i,-Lipschitz over [—4z^, Ai/] and 
that |||-F|||op < 2 for A G Fq, and in step (ii) from the dehnition u := Since |||T|||f < 2yV for 
A G Fq, we have 


T 2 < -\lUnh{M*/a)\l^ sup 

Aero 


< ^|||noh(M7cT) 


a 


Note that for each index pair {i,j), EYhij{M*j/a) = 0, and that \hij(M*j/cr)\ < 4 L 41 , since 
ll-Tf^/crlloo < A:U- Therefore, the matrix /a) is a censored sub-Gaussian random matrix 

satisfying the assumptions in Lemma 11, by which we obtain |||nf 2 /i(M*/(T)|||op < with 

probability at least 1 — It follows that with the same probability, the second term is bounded 

as T 2 < < -^aSn- Finally, the third term T 3 can be bounded as 

T,<^ sup |||[Enoh(M/a)]|||F|||T|||op = ^ sup jp • 

O'Aero cTAero - f[M/a)) 


Note that |||T|||op < 2 for A G Fq. Moreover, because / satishes (29), we know that | < 

y/L^ and \f{x) — /(x')| < y/L^\x — x'\ for all x,x' G [—Ai 2 ,Au]. It follows that 


8pL4^ ^ SpLi^ 3 W 24V^L4j,i/ 1 

F 3 < -^ sup M - M F < -^ ^ < - < TT^aen, 

( 7 ^ Aero ^ ^ cr 12 


where the step (i) follows from the dehnition 1 / : = This completes the proof of the lemma. 
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F Proof of inequality (32) 


Recalling that the matrix F* is orthonormal and ^-incoherent, we have ||oo < and 

hence ||y — < 2^. On the other hand, we claim that each row and column of the 

matrix Y — F*0F* has at most k non-zero elements. To see this, let be the set of the non-zero 
element of S*. If {i,j) 0 then \Yij\ = \{F*iSiF*)ij\ < so Yij = Yij = {F*®F*)ij and thus 
(Y — F*®F*)ij = 0. Therefore, we find that Y — F*®F* is supported on the elements in <!>*, hence 
the claim. With the above two facts, we apply Proposition 3 in the paper |20| to obtain that 

l\Y - < k\\Y - F*<S)F*\\oc < 2^. 

On the other hand, the gap between the r-th and (r -|- l)-th singular values of the matrix F*0F* 
is 1. Letting U be the matrix of the top-r singular vectors of Y and using ©[•,•] to denote the 
principal angles between two subspaces, we find that 

min |||t7-F3||„p < v/2|||sin0[col(t7),col(F*)]|||„p < 2\lY - F*^F*\l^ < 

F*&S{M*) a 


where the first step follows from Proposition 2.2 in the paper 591 and the second step follows from 
Wedin’s sinO theorem |30|. It follows that 


d{F^,F*) <d{U,F*) min jU - F*\l^ < 

F*&£{M*) a 


where the first step holds because F^ = njr([/) and projection onto the convex set F is non- 
expansive, which completes the proof of the claim. 


G Spectral norms of censored sub-Gaussian random matrices 

In this appendix, we state and prove a useful bound on the spectral norm sub-Gaussian random 
matrices with censored entries. 

Lemma 11. Suppose X G is a symmetric random matrix with Xij = gijQij, where {pij \ i > 

j} are independent zero-mean sub-Gaussian random variables with parameter 1, {Qij \ i > j} are 
independent Bernoulli variables with parameter p, and they are mutually independent. Then there 
exists a universal constant c > 0 such that 


IE[|||-T||U < c(v^-hlog(i), and 


\\X\lop > c{^/pd + logd) 


< d 


-12 


(75a) 

(75b) 


Let us now prove this lemma. By a standard symmetrization argument, we can assume without 
loss of generality that each gij is a symmetric random variable. To proceed, we need the following 
result from Bandeira and van Handel [^: 

Proposition 1 (Corollaries 3.6 and 3.12 in j^). Let X be the d x d symmetric random ma¬ 
trix whose entries Xij are independent symmetric random variables bounded by d*, and define 


58 








a : = maxj Then there exist universal constants ci and £2 such that 


IE|||X|||op < 3cJ + Cl O'* ^/logd, and 


IIXII 


> Sa + i] < d ■ exp f — 


C2(Tt 


for each t > 0. 


(76a) 

(76b) 


To apply the proposition with unbounded entries in X, we use a standard truncation argument. 
For some constant b to be specified later, let X be the matrix with Xij = Xijl^Xij\<b^To^- Observe 
that X satisfies the assumption in Proposition with d* < by/log d and a < ^/pd. Applying 
Proposition!^ with t = '/Vlb^fPlogd^ we obtain the bounds 


IE|||X|||op < + ci61ogd, and 

||X|||op > 3i/pd + \/l26^logd] < dexp < d“^^, 


(77a) 

(77b) 


where the last inequality follows from t > cr*\/12c log d. On the other hand, by choosing the constant 
b sufficiently large and using a standard bound on the maximum of sub-Gaussian variables, we know 
that 


[X / X] < P max \gij\ > fty^log d 




< d 


-13 


Combining with the tail bound (|77b|) yields 

X / X 


||X|||op > 3i/pd + y/l2cb‘^ logd 


< 


+ ■ 


||X|||op > 3y/pd + Vl2cb‘^ logd 


< d 


-12 


which proves the second inequality in Lemma 11 


Turning to the first inequality in the lemma, we let X be the matrix with Xy = Xy — Xy = 
Xijl^Xij\>b^/\o^^ observe that by definition, P(0 < maxjj |Xy| < b^ylog d) = 0. Moreover, by 
choosing the constant b sufficiently large and using a standard concentration inequality for convex 
Lipschitz functions (44], we find that for each t > 0, 


max |Xy I > by^log d + t 


< 


max \gij\ > E max \gij \ + t + 4y^logd 
i,j i,j 


“ “ d^ 


integrating these tail bounds gives E[maxjj |Xy|] < ^^^ 2 ^^ Combining with equation (77a) yields 
the upper bound 


E|||X|||„p <E|||X|||„p + E|||X|||„p <E|||X|||„p + dEmax|Xy| < 3^ + ci61ogd + 




cVlog d 
d ’ 


which completes the proof of Lemma 11 
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